The Executive Guide to Data Warehouse Augmentation

January 19, 2017 Rajesh Nadipalli

The traditional data warehouse (DW) is constrained in terms of storage capacity and processing power. That’s why the overall footprint of the data warehouse is shrinking as companies look for more efficient ways to store and process big data. Although data warehouses are still used effectively by many companies for complex data analytics, creating a hybrid architecture by migrating storage and large-scale or batch processing to a data lake enables companies to save on storage and processing costs and get more value from their data warehouse for business intelligence activities.

Designing Your Data Lake Architecture

Getting started with a traditional data warehouse can be a difficult first step. The video below explains the traditional DW architecture, pain points of the architecture as well as the modern data lake architecture.

Offloading storage as well as extract, transform and load (ETL) functions to a scale out architecture, such as Hadoop, enables enterprises to focus the DW on what it does best: Business Intelligence (BI). Data can be sent to BI tools and analytical tools that understand Hadoop or DW can augment the data lake to handle legacy tools, if needed. Consumers include BI tools that connect to DW using JDBC/ODBC or statistician tools like R and SAS.

Data from the various source systems is sent to a ETL tool where it is cleansed like removing bad records. Data is then standardized like data formats, transformed like lookups, joined into Facts and Dimensions and finally loaded to DW.

Building Your Data Lake

Now that you know the importance of having the right architecture, think about three key components to building a modern data lake: the hydrator, the transformer and the provisioner. Here’s why you need them and how they can help you make sense of your architecture:

Hydrator - Typically done through Bedrock managed ingestion, this is the architectural component that brings in data from various source systems into the data lake.

To build this component, you will need:

  • Source Connection manager
    • Source Type, Credentials, Owner
  • Data Feed Configuration
    • Feed Name, Type (RDBMS/File/Streaming)
    • Mode - Incremental/Full/CDC
    • Expected Latency
    • Structure information, PII
  • Reusable Scripts / Workflows on top of core components
    • Hadoop API’s for file
    • Sqoop for RDBMS
    • Kafka, Flume for streaming
  • Operational Stats
    • What, Who, When, Why
    • Failures and Notifications
    • SLA monitoring

Transformer - Consider this component of the architecture the next generation ETL that cleans bad data, correlates and creates enriched insights from raw data.

To build this component, you will need:

  • Application Development Platform
    • Built on Hadoop components Spark, MapReduce, Pig, Hive
    • Abstract and build reusable workflows for common problems
  • Business Rules Integration
    • The application platform should be able to integrate easily with rules provided by business
    • For example, an insurance company might have several rules for computing policy discount. The company should be able to change the rules without IT involvement
  • Workflow Scheduling / Management
    • Workflows should have scheduling, dependency management and logging
  • Operational Stats
    • What, Who, When, Why
    • Failures and Notifications
    • SLA monitoring

Provisioner - This component is designed to extract data from the data lake and provide it to the consumers.

Below are key design aspects of this component:

  • Destination Connection manager
    • Destination Type, Credentials, Owner
    • Provisioning Metadata:  
    • Type (RDBMS/File/Streaming)
    • Filters if applicable
    • Mode Full / Incremental
    • Frequency: daily / hourly / message
  • Reusable Scripts/Workflows on top of core components
    • Hadoop API’s for file
    • Sqoop for RDBMS
    • Kafka, Flume for streaming
  • Operational Stats
    • What, Who, When, Why
    • Failures and Notifications
    • SLA monitoring

Benefits of Data Warehouse Augmentation

Building a modern data lake architecture has a long list of benefits. Zaloni works closely with enterprises to design the architecture for DW offload and implements Bedrock - the industry’s only fully integrated Hadoop data management platform - to not only accelerate deployment, but also significantly improve visibility into the data. Learn how to save millions and enable faster time to insight by downloading our DW Augmentation solution brief.

If you have questions or concerns about your DW architecture, reach out to us.

About the Author

Rajesh Nadipalli

Director of Product Support and Professional Services

More Content by Rajesh Nadipalli
Previous Article
A Seat in Front of the Camera: The Power of Video in Tech
A Seat in Front of the Camera: The Power of Video in Tech

Videos are everywhere you look and emerging technologies have made it possible for you to have an audience ...

Next Article
Deriving Value from the Data Lake
Deriving Value from the Data Lake

Excerpt from ebook, Architecting Data Lakes: Data Management Architectures for Advanced Business Use Cases,...

×

Get the latest tips and how-to's delivered straight to your inbox!

First Name
Last Name
Zaloni Blog Email Subscription
Thank you!
Error - something went wrong!