Zaloni Bedrock Delivers Cost and Time Savings for Global Consumer Insights Firm

January 26, 2016

Zaloni develops backend solution architecture to speed data processing and extraction time.

A leading provider of consumer information, predictive analytics and business intelligence to 95% of the Fortune 500 consumer packaged goods (CPG) and retail companies needed a more cost-effective and efficient approach to process, analyze and manage data. Tasked with managing massive volumes of data from disparate sources (ingesting hundreds of gigabytes of data from external data providers every week), the company’s ongoing goal is to keep technology costs as low as possible while providing clients with state-of-the-art analytic and intelligence tools. Although the company has always been on the forefront of big data, before adopting Hadoop, it had exclusively relied on mainframe and warehouse technologies.

Challenges

Throughput Limitations and Ballooning Costs

The company needed to build a foundation for a more cost-effective, flexible and expandable data processing and storage environment. Additionally, it needed a new big data platform that would solve some of its largest technical challenges. These included:

  • Reducing mainframe load
  • Reducing mainframe support risk
  • Offloading significant amounts of data from the mainframe and data warehouses to Hadoop

Mainframe Offload

The company used a mainframe for point of sale (POS) data, which was costly to maintain and use, and limited in its throughput. Zaloni partnered with the company to create a system running on MapR’s M5 version of Hadoop to address the following technical challenges:

  • Atomic view: A custom multi-version concurrency control mechanism was built using Zookeeper, allowing continuous extraction to run concurrently with jobs performing updates to the underlying dataset—and prevent a corrupted view for users pulling data during updates.
  • Sustainable high extract throughput: Extract requests were batched together to optimize read operations across the cluster to maximize input/output (I/O) server virtualization. The solution achieved 1-5 million records per second, exceeding the mainframe’s peak rates.
  • Generic byte-level sorting: The new system allowed sorting on any field in any order; a UPC-based sorting mechanism also provided sorting at the product level.

Warehouse Offload

The company’s Warehouse team works with tens of terabytes of fact and dimension data, across several individual data warehouses. Extractions from the warehouses feed a large, downstream, client-facing reporting farm. The existing system (a major data warehouse solution) was being pushed beyond its capabilities—barely meeting SLAs—and the cost of adding additional instances was prohibitive.

Before working with Zaloni, the company had only looked at traditional warehouse technologies. After meeting with Zaloni, the company realized that housing the aggregate POS dataset and servicing the extractions that populate the company’s custom, in-memory analytics farm in Hadoop would be significantly more cost effective. Furthermore, it would be faster. In other words, the company would be able to offload more data, faster, and realize substantial savings in a very short period of time.

Ingestion, Orchestration and Transformation

Expanding on the POS Warehouse project, Zaloni worked on creating a flexible solution to support other warehouses, ease the onboarding of new data providers, and provide a framework for leveraging the compute power of Hadoop for analytics.

Solution

A New Architecture for Automated Data Ingestion and Extraction

Working with the company’s IT team, Zaloni designed and built the backend solution architecture for the offload solution based on Zaloni’s Bedrock data management platform.

Using an automated and flexible schema registration for the new warehouses, automatic ingestion of data, and high-throughput extraction via a metadata-driven process, Zaloni’s solution enabled the offload to Hadoop to be completed in half the time it would have taken a homegrown solution to complete the process. Development took just under six months, followed by two months of final testing, performance fine-tuning and running the solution in parallel with regular production processes.

In addition, the new platform enabled the IT team to efficiently manage and track various types of data (beyond POS data) from multiple countries and multiple environments (e.g., “test” and “prod”).

Benefits

  • $5.2 million annual savings; projected additional savings of $4.4 million (upon completion of the fine-grained data-at-rest modification project)
  • Reduced mainframe MIPS (millions of instructions per second/processing power and CPU consumption) by nearly half, while providing better throughput than using just Hadoop alone; the fine-grained data-at-rest modification project targets complete sunset of the mainframe
  • Achieved throughput rate of over 1 million records per second, on a 16-node cluster
Previous Article
Data Lake Unleashes Real-time Subscriber Activities for Telecommunications
Data Lake Unleashes Real-time Subscriber Activities for Telecommunications

By pulling multiple and disparate data sources into the data lake, a major European telecommunications prov...

Next Article
Managed Data Lake Helps Leading Healthcare Provider Reduce Readmission Rates
Managed Data Lake Helps Leading Healthcare Provider Reduce Readmission Rates

Case Study: Learn how Kaiser Permanente significantly reduced readmission rates as a result of having a wel...

Want an agile, scalable data lake?

Contact Us