A data lake is a central location in which to store all your data, regardless of its source or format. It is typically, although not always, built using Hadoop. The data can be structured or unstructured. You can then use a variety of storage and processing tools—typically tools in the extended Hadoop ecosystem—to extract value quickly and inform key organizational decisions.
Because of the growing variety and volume of data, data lakes are an emerging and powerful architectural approach, especially as enterprises turn to mobile, cloud-based applications, and the Internet of Things (IoT) as right-time delivery mediums for big data.
Data Lake versus EDW
The differences between enterprise data warehouses (EDW) and data lakes are significant. An EDW is fed data from a broad variety of enterprise applications. Naturally, each application’s data has its own schema, requiring the data to be transformed to conform to the EDW’s own predefined schema. Designed to collect only data that is controlled for quality and conforming to an enterprise data model, the EDW is capable of answering only a limited number of questions.
Data lakes, on the other hand, are fed information in its native form. Little or no processing is performed for adapting the structure to an enterprise schema. The biggest advantage of data lakes is flexibility. By allowing the data to remain in its native format, a far greater—and timelier—stream of data is available for analysis.
Some of the benefits of a data lake include:
- Ability to derive value from unlimited types of data
- Ability to store all types of structured and unstructured data in a data lake, from CRM data to social media posts
- More flexibility—you don’t have to have all the answers up front
- Ability to store raw data—you can refine it as your understanding and insight improves
- Unlimited ways to query the data
- Application of a variety of tools to gain insight into what the data means
- Elimination of data silos
- Democratized access to data via a single, unified view of data across the organization when using an effective data management platform
To be classified as a data lake, a big data repository should exhibit three key characteristics:
- A single shared repository of data, typically stored within Distributed File System (DFS). Hadoop data lakes preserve data in its original form and capture changes to data and contextual semantics throughout the data lifecycle. This approach is especially useful for compliance and internal auditing activities. This is an improvement over the traditional EDW, where if data has undergone transformations, aggregations and updates, it is challenging to piece data together when needed, and organizations struggle to determine the provenance of data.
- Includes orchestration and job scheduling capabilities (e.g., via YARN). Workload execution is a prerequisite for enterprise Hadoop and YARN provides resource management and a central platform to deliver consistent operations, security and data governance tools across Hadoop clusters, ensuring analytic workflows have access to the data and the computing power they require.
- Contains a set of applications or workflows to consume, process or act upon the data. Easy user access is one of the hallmarks of a data lake, due to the fact that organizations preserve the data in its original form. Whether structured, unstructured or semi-structured, data is loaded and stored as-is. Data owners can then consolidate customer, supplier and operations data, eliminating technical—and even political—roadblocks to sharing data.
Data lakes are becoming more and more central to enterprise data strategies. Data lakes best address today’s data realities: much greater data volumes and varieties, higher expectations from users, and the rapid globalization of economies. At Zaloni, we help clients implement data lakes with the management and governance capabilities in place to ensure scalable, actionable data lakes for now and into the future.
To learn more about building a governed, actionable and scalable data lake, get our eBook Architecting Data Lakes: Data Management Architectures for Advanced Business Use Cases
About the Author
Kelly Schupp is Vice President of Marketing for Zaloni. Kelly has 20 years of experience in the enterprise software and technology industry. She has held a variety of global marketing leadership roles, and previously worked at IBM, Micromuse and Porter Novelli. Kelly serves as Zaloni’s brand steward and is deeply passionate about the impact of data-driven marketing.More Content by Kelly Hopkins Schupp