Zaloni Zip: Building a Modern Data Lake Architecture Pt. 2

December 13, 2016 Rajesh Nadipalli

In the last video, we looked at the pain points of traditional data warehouse architecture and the high level architecture of the next generation Data Lake based on Hadoop.

In this video, I will discuss the key components you need to build a new architecture.

 

 
HydratorTypically done through Bedrock managed ingestion, this is the architectural component that brings in data from various source systems into the Data Lake.
 
To build this component, you will need:
 
  • Source Connection manager
    • Source Type, Credentials, Owner
  • Data Feed Configuration
    • Feed Name, Type (RDBMS/File/Streaming)
    • Mode - Incremental/Full/CDC
    • Expected Latency
    • Structure information, PII
  • Reusable Scripts / Workflows on top of core components
    • Hadoop API’s for file
    • Sqoop for RDBMS
    • Kafka, Flume for streaming
  • Operational Stats
    • What, Who, When, Why
    • Failures and Notifications
    • SLA monitoring

Transformer - Consider this component of the architecture the next generation ETL that cleans bad data, correlates and creates enriched insights from raw data.

To bulid this component, you will need: 

  • Application Development Platform
    • Built on Hadoop components Spark, MapReduce, Pig, Hive
    • Abstract and build reusable workflows for common problems
  • Business Rules Integration
    • The application platform should be able to integrate easily with rules provided by business
    • For example, an insurance company might have several rules for computing policy discount. The company should be able to change the rules without IT involvement
  • Workflow Scheduling / Management
    • Workflows should have scheduling, dependency management and logging
  • Operational Stats
    • What, Who, When, Why
    • Failures and Notifications
    • SLA monitoring

Provisioner - This component is designed to extract data from the data lake and provide it to the consumers.

Below are key design aspects of this component:

  • Destination Connection manager
    • Destination Type, Credentials, Owner
    • Provisioning Metadata:  
    • Type (RDBMS/File/Streaming)
    • Filters if applicable
    • Mode Full / Incremental
    • Frequency: daily / hourly / message
  • Reusable Scripts/Workflows on top of core components
    • Hadoop API’s for file
    • Sqoop for RDBMS
    • Kafka, Flume for streaming
  • Operational Stats
    • What, Who, When, Why
    • Failures and Notifications
    • SLA monitoring

For more information on building a modern data lake architecture, view the infographic or get the white paper.

About the Author

Rajesh Nadipalli

Director of Product Support and Professional Services

More Content by Rajesh Nadipalli
Previous Article
Validating Data in the Data Lake: Best Practices
Validating Data in the Data Lake: Best Practices

Can you trust the data in your data lake? Many companies are guilty of dumping data into the data lake with...

Next Article
Building a Machine Learning Engine from Big Data
Building a Machine Learning Engine from Big Data

Machine learning (ML) is still growing as a field in big data and has of late made some significant advance...

×

Get the latest tips and how-to's delivered straight to your inbox!

First Name
Last Name
Zaloni Blog Email Subscription
Thank you!
Error - something went wrong!