Zaloni develops backend solution architecture to speed data processing and extraction time.
A leading provider of consumer information, predictive analytics and business intelligence to 95% of the Fortune 500 consumer packaged goods (CPG) and retail companies needed a more cost-effective and efficient approach to process, analyze and manage data. Tasked with managing massive volumes of data from disparate sources (ingesting hundreds of gigabytes of data from external data providers every week), the company’s ongoing goal is to keep technology costs as low as possible while providing clients with state-of-the-art analytic and intelligence tools. Although the company has always been on the forefront of big data, before adopting Hadoop, it had exclusively relied on mainframe and warehouse technologies.
Throughput Limitations and Ballooning Costs
The company needed to build a foundation for a more cost-effective, flexible and expandable data processing and storage environment. Additionally, it needed a new big data platform that would solve some of its largest technical challenges. These included:
- Reducing mainframe load
- Reducing mainframe support risk
- Offloading significant amounts of data from the mainframe and data warehouses to Hadoop
The company used a mainframe for point of sale (POS) data, which was costly to maintain and use, and limited in its throughput. Zaloni partnered with the company to create a system running on MapR’s M5 version of Hadoop to address the following technical challenges:
- Atomic view: A custom multi-version concurrency control mechanism was built using Zookeeper, allowing continuous extraction to run concurrently with jobs performing updates to the underlying dataset—and prevent a corrupted view for users pulling data during updates.
- Sustainable high extract throughput: Extract requests were batched together to optimize read operations across the cluster to maximize input/output (I/O) server virtualization. The solution achieved 1-5 million records per second, exceeding the mainframe’s peak rates.
- Generic byte-level sorting: The new system allowed sorting on any field in any order; a UPC-based sorting mechanism also provided sorting at the product level.
The company’s Warehouse team works with tens of terabytes of fact and dimension data, across several individual data warehouses. Extractions from the warehouses feed a large, downstream, client-facing reporting farm. The existing system (a major data warehouse solution) was being pushed beyond its capabilities—barely meeting SLAs—and the cost of adding additional instances was prohibitive.
Before working with Zaloni, the company had only looked at traditional warehouse technologies. After meeting with Zaloni, the company realized that housing the aggregate POS dataset and servicing the extractions that populate the company’s custom, in-memory analytics farm in Hadoop would be significantly more cost effective. Furthermore, it would be faster. In other words, the company would be able to offload more data, faster, and realize substantial savings in a very short period of time.
Ingestion, Orchestration and Transformation
Expanding on the POS Warehouse project, Zaloni worked on creating a flexible solution to support other warehouses, ease the onboarding of new data providers, and provide a framework for leveraging the compute power of Hadoop for analytics.
A New Architecture for Automated Data Ingestion and Extraction
Working with the company’s IT team, Zaloni designed and built the backend solution architecture for the offload solution based on Zaloni’s Bedrock data management platform.
Using an automated and flexible schema registration for the new warehouses, automatic ingestion of data, and high-throughput extraction via a metadata-driven process, Zaloni’s solution enabled the offload to Hadoop to be completed in half the time it would have taken a homegrown solution to complete the process. Development took just under six months, followed by two months of final testing, performance fine-tuning and running the solution in parallel with regular production processes.
In addition, the new platform enabled the IT team to efficiently manage and track various types of data (beyond POS data) from multiple countries and multiple environments (e.g., “test” and “prod”).
- $5.2 million annual savings; projected additional savings of $4.4 million (upon completion of the fine-grained data-at-rest modification project)
- Reduced mainframe MIPS (millions of instructions per second/processing power and CPU consumption) by nearly half, while providing better throughput than using just Hadoop alone; the fine-grained data-at-rest modification project targets complete sunset of the mainframe
- Achieved throughput rate of over 1 million records per second, on a 16-node cluster