Data lakes are a promising, exciting, and unfortunately, often intangible concept. While it sounds great to have a single repository for all of your files, regardless of format, what does that look like in an actual deployment? In this and subsequent blogs, we’ll look at putting some more concrete details to the idea of the data lake. This edition will focus on a unified, 360-degree view of the business utilizing customer data in a data lake.
To set up the discussion, let’s look at an adaptation of the standard data lake architecture, tailored to this use case.
This diagram illustrates a common problem: customers have a large array of different sources and source formats, all of which eventually need to be merged, indexed, or leveraged by third-party tools. This is the sweet spot for data lakes since the central tenet is to store all data, regardless of format; the problem is organization and maintenance of all that data.
In our scenario, we have three main sources of data:
- Traditional file-based data coming from local systems and HDFS
- RDBMS repositories such as MySQL or Oracle
- Streaming data, such as Kafka topics
Specifically, the sources in the above diagram include delimited sources (shown in blue) that provide sales transaction and customer feedback data, RDBMS sources (shown in green) containing customer details (address, name, age, etc.) and inventory details, some XML logs (in red) collected from web servers, and avro or parquet data (in orange) collected from a clickstream tracker.
Initially, we may not know much about the schema, content, or structure of the incoming data besides the filename, table name, or topic name. This isn’t much of a problem for the general concept of the data lake since we don’t need to know schema in order to write data, but it quickly becomes an issue once we want to use that data for some specific purpose.
Ingestion + Metadata
This is where metadata-based ingestion becomes valuable. As data enters the Raw Zone, it is tagged with metadata and registered to a specific source or “entity”. This tagging can occur multiple ways, but might be based on some predefined schema, context derived directly from the data, or metadata pulled from the source system. Regardless of how the metadata is derived, having this layer of context allows for the efficient organization of all the data within the lake, and also allows us to start processing this data.
In the context of the diagram, we are using a cloud-based EMR cluster to perform various actions, such as data quality, watermarking, conversion, tokenization, and masking. Realistically, a good data lake should be agnostic to the underlying compute and storage infrastructure. These operations form what we call the Trusted Zone, populated by cleansed, normalized and regulation-compliant data.
Specifically, in our diagram, we have some malformed or invalid data in the customer feedback data, represented by the red lines in that source. This data will need to be removed from the Trusted Zone data based on predefined data quality rules. It can then be cleansed, tagged for follow-up, or thrown out altogether (since we have the raw data maintained in the Raw Zone)- the specific process depends on business needs.
We’ve also got some sources that might contain sensitive or PII data, such as credit card numbers (in sales transactions logs) or social security numbers (in the customer DB). These fields need to be masked or tokenized before being moved into the Trusted Zone, for example, using an SHA-256 algorithm or other methods. It’s also possible to do some parsing or normalization of specific data - in this scenario, on the XML web log data, which will make it easier to consume downstream. We may also have sources that can pass directly from the Raw to Trusted Zone with no processing, such as the inventory and clickstream data.
After data is in the Trusted Zone, it’s available for consumption from downstream BI tools, or for processing. Since all of this data is related to customers, we can create a unified, 360-degree view of the business by merging all this data into one table, for example, using Spark. This gives us one table that contains aspects such as purchases, web clicks, and feedback ratings, which can then be further analyzed and processed by BI tools or by a data lake-native tool such as the Zaloni Data Platform.
Stay tuned for more posts focusing on things like data ingestion and use of tricky unstructured formats (PDFs, XLS, etc.), regulation and audit compliance (GDPR, RDARR, etc…), and other topics well-suited to the data lake.
Learn more about Zaloni's data lake solution for Customer 360.