Applied Data Lakes: Building a 360° View of Your Customer

May 30, 2017 Greg Wood

Data lakes are a promising, exciting, and unfortunately, often intangible concept. While it sounds great to have a single repository for all of your files, regardless of format, what does that look like in an actual deployment? In this and subsequent blogs, we’ll look at putting some more concrete details to the idea of the data lake. This edition will focus on a unified, 360-degree view of the business utilizing customer data in a data lake. 

To set up the discussion, let’s look at an adaptation of the standard data lake architecture, tailored to this use case.

 

data lake customer 360

 

This diagram illustrates a common problem: customers have a large array of different sources and source formats, all of which eventually need to be merged, indexed, or leveraged by third-party tools. This is the sweet spot for data lakes since the central tenet is to store all data, regardless of format; the problem is organization and maintenance of all that data.

In our scenario, we have three main sources of data:

  • Traditional file-based data coming from local systems and HDFS
  • RDBMS repositories such as MySQL or Oracle
  • Streaming data, such as Kafka topics

Specifically, the sources in the above diagram include delimited sources (shown in blue) that provide sales transaction and customer feedback data, RDBMS sources (shown in green) containing customer details (address, name, age, etc.) and inventory details, some XML logs (in red) collected from web servers, and avro or parquet data (in orange) collected from a clickstream tracker.

Initially, we may not know much about the schema, content, or structure of the incoming data besides the filename, table name, or topic name. This isn’t much of a problem for the general concept of the data lake since we don’t need to know schema in order to write data, but it quickly becomes an issue once we want to use that data for some specific purpose.

Ingestion + Metadata 

data architecture raw zoneThis is where metadata-based ingestion becomes valuable. As data enters the Raw Zone, it is tagged with metadata and registered to a specific source or “entity”. This tagging can occur multiple ways, but might be based on some predefined schema, context derived directly from the data, or metadata pulled from the source system. Regardless of how the metadata is derived, having this layer of context allows for the efficient organization of all the data within the lake, and also allows us to start processing this data.


In the context of the diagram, we are using a cloud-based EMR cluster to perform various actions, such as data quality, watermarking, conversion, tokenization, and masking. Realistically, a good data lake should be agnostic to the underlying compute and storage infrastructure. These operations form what we call the Trusted Zone, populated by cleansed, normalized and regulation-compliant data.

data lake architecture trusted zone

Specifically, in our diagram, we have some malformed or invalid data in the customer feedback data, represented by the red lines in that source. This data will need to be removed from the Trusted Zone data based on predefined data quality rules. It can then be cleansed, tagged for follow-up, or thrown out altogether (since we have the raw data maintained in the Raw Zone)- the specific process depends on business needs.

We’ve also got some sources that might contain sensitive or PII data, such as credit card numbers (in sales transactions logs) or social security numbers (in the customer DB). These fields need to be masked or tokenized before being moved into the Trusted Zone, for example, using an SHA-256 algorithm or other methods. It’s also possible to do some parsing or normalization of specific data - in this scenario, on the XML web log data, which will make it easier to consume downstream. We may also have sources that can pass directly from the Raw to Trusted Zone with no processing, such as the inventory and clickstream data.

After data is in the Trusted Zone, it’s available for consumption from downstream BI tools, or for processing. Since all of this data is related to customers, we can create a unified, 360-degree view of the business by merging all this data into one table, for example, using Spark. This gives us one table that contains aspects such as purchases, web clicks, and feedback ratings, which can then be further analyzed and processed by BI tools or by a data lake-native tool such as the Zaloni Data Platform.

Stay tuned for more posts focusing on things like data ingestion and use of tricky unstructured formats (PDFs, XLS, etc.), regulation and audit compliance (GDPR, RDARR, etc…), and other topics well-suited to the data lake.

Learn more about Zaloni's data lake solution for Customer 360.

 

Data Lake Platform for Customer 360°

 

Previous Article
Hive Basics - Elasticsearch Integration
Hive Basics - Elasticsearch Integration

The concerns and benefits of using the Elasticsearch-Hadoop connector for extending the existing external t...

Next Article
Beyond Ingestion: Automating Processes for Managing the Data Catalog
Beyond Ingestion: Automating Processes for Managing the Data Catalog

Data ingestion requires capturing metadata and cataloging that data to create a healthy data ecosystem. Thi...

×

Subscribe to the latest data lake expertise!

First Name
Last Name
Company
I would like to subscribe to email updates about content and events.
Zaloni is committed to the best experience for you. Read more on our Privacy Policy.
Thank you!
Error - something went wrong!