As enterprises modernize their aging data platforms, data lake architecture has become crucial to the future of their businesses. The term data lake has evolved in enterprise data architecture to mean a scalable data storage and compute platform that can flexibly hold data of all types, and process and query that data in various ways. In practice, scalability has been achieved using a Hadoop based distributed platform, and flexibility using a block storage abstraction that is agnostic to whether data is structured, semi-structured, or unstructured.
What is missing from the above definition is the temporal flow of data from disparate sources into the data lake, and the dynamic surround that makes the data lake itself useful: the means to know what is in the data lake (metadata and catalog), the policies determining who, when, and how of data usage (governance and security), and the tools to derive insights (analytics of all kinds) that should ultimately translate to better business outcomes.
What then is “Data Lakes in the Cloud”? As was done with other enterprise IT platforms, the earliest attempts were to mimic on-premises Hadoop infrastructure in the cloud. This may have been part of the journey towards a more mature cloud adoption model since organizations could leverage their existing skill sets. But as an antithesis to the utility model of cloud usage, such attempts to create and maintain a permanent enterprise infrastructure in the cloud hardly makes any economic sense. And, a “lake in the cloud”, applied to data or otherwise is not the most meaningful metaphor.
We define “Data Lakes in the Cloud” as a data fabric that pervades the enterprise and multiple cloud provider realms, overlaid by a management plane. Such a fabric affords a seamless view of data and motivates the most optimal use of multiple storage and compute options across private and public clouds. It leverages IaaS, as well as PaaS and higher level cloud services, unrestricted by physical infrastructure boundaries or specific distributed system technologies like Hadoop. We view the combining of data lake and cloud technologies less as a means for physically assembling data in a single repository, but rather as a means for assembling metadata in order to efficiently derive value from data regardless of its location. Through the current series of blogs, we strive to present the various facets of building, populating, maintaining, and benefiting from a cloud data lake that extends beyond the internal datacenters in an enterprise.
Benefits of building a data lake in the cloud
The economics of the cloud model has proven beneficial for the enterprise where rapid provisioning, and scaling up and down of IT resources is needed, not to mention the attractiveness of “pay only for what you use” billing and avoidance of capital expenses. Consequently, we have seen an on-going extension of the enterprise data center to the cloud and a relentless migration of applications. Lest an enterprise data lake platform be left behind the cloud trend, it is only natural to explore how it fits within and benefits from the cloud model.
A key driver then for building a data lake in the cloud is obviously the well-understood advantage of the cloud model. The dynamics of working with data in a data lake often calls for rapid provisioning of compute resources, scaling up or down in response to data flows and processing needs. These requirements are well served by the cloud.
With a choice of abstraction levels in terms of infrastructure, platform, and application, there are numerous options for building a cloud data lake. But the greatest benefit is realized at the application level where the heavy lifting of managing the infrastructure and platform is delegated to the cloud providers.
A data management layer as indicated in our definition of a cloud data lake is central to the efficient use of a data platform built around such cloud infrastructure and platform offerings.
Cloud-Based Options for Data Lakes
Advances in distributed computing with the simultaneous availability of low-cost commodity servers and a groundswell of open source interest have propelled the Hadoop platform as the de facto distributed operating system for holding and processing big data in the enterprise. Distributing such data across a cluster of servers and dispatching processing code to where the data is, has been a game changer in processing big data at speeds not possible before.
Given Hadoop and related ecosystem frameworks as the basis of a data lake, what are the options for data lakes in the cloud? As we mentioned before, one option is to simply recreate the on-premises Hadoop cluster in the cloud as a first step in the journey toward the cloud model. But as the journey progresses, it is instructive to explore the options in the categories of storage, compute, and other cloud native services which would be used as such or integrated into a data lake.
Storage - The possibilities for storage range from block to object to file level abstractions, with different degrees of redundancy, availability and consistency guarantees, and cost considerations.
Compute - A variety of compute server types are possible, optimized for different types of memory and processing requirements depending on the workload.
Cloud services - These provide higher levels of resource and service abstractions such as cloud provider managed Hadoop clusters, managed databases, warehouses, messaging services, etc.
A more mature Cloud Data Lake would use cloud services beyond just Hadoop clusters. It would use a variety of streaming services, NoSQL and in-memory databases, and queuing and workflow decoupling services for an end-to-end data lake. The orchestration of such data pipelines and metadata management would be handled by a data lake management layer.
Assessing Cloud Providers
As cloud based service providers proliferate, there are a multitude of options for enterprises today to begin their cloud data lake journey. The options include IaaS plays with bare metal or virtualized infrastructures to build the data lake. This can be complemented with PaaS layers that provide managed data platforms that include various options for event-based data ingestion, data processing and serving layers. Several cloud providers are also starting to offer analytics as a service with machine learning offerings built on top of their IaaS and PaaS layers.
Selection of cloud providers should be based on the requirements of the organization. Some enterprises require wide geographical coverage due to local in-country data requirements. In such cases, footprint and coverage of the cloud provider is an important consideration. Sometimes multiple cloud providers may need to be considered to provide such geographical coverage. In addition to the geographical coverage, mapping the requirements of the organization to infrastructure, platform, security, and network offerings of the cloud providers should be considered. Those offerings will further need to be mapped to the requirements such as type of workloads (batch, streaming, relational), provider managed vs. self-managed, cost, administration, etc.
An important topic that frequently comes up in these discussions is cloud provider lock-in. Increasingly, enterprises are focusing on portability across cloud providers. This is specifically important in multi-cloud environments where the logical data lake is made up of multiple physical environments from different cloud providers due to geographical coverage or other considerations. For example, there could be a data platform based on Cloud Provider A in US and another data platform based on Cloud Provider B in Japan. The enterprise needs to have a logical view of the data lake with a unified catalog and proper data governance of all their datasets from both these environments. In such cases application portability can be provided by containerizing the applications using one of the container frameworks. However data portability needs to be considered so that applications can access the data in a cloud provider agnostic way. One common approach is to put together an API abstraction layer for keyed data access. For bulk data access and processing, Hadoop and its ecosystem frameworks instead of a cloud native data platform may be used to provide that portability.
Patterns and anti-patterns for cloud use
All of the extolled benefits of a cloud-based platform requires that cloud paradigms are understood and appropriate architectural and implementation patterns followed, and anti-patterns avoided. Here, we mention few of the best practices applicable to realizing a cloud Data Lake.
- Implement Data Lake in the cloud using elastic compute and cloud-optimized storage
- Use Data Lake provided as a cloud service that is managed and optimized by the cloud provider
- Data pipelines with processing components decoupled by queuing services
- Leaving the heavy lifting to cloud provider services, example, for elastic clusters, streaming, analytics and machine learning
- Using cloud storage rather than ephemeral storage with data lifecycle management
- Real-time processing with event-driven architectures for streaming data
- Forklift migration of on-premise Data Lake to the cloud.
- Unmanaged, unmonitored, long-term usage of resources such as persistent on-demand compute instances.
- Dedicating cloud resources for service peaks rather than using cloud scaling services
A Cloud Data Lake combines the principles of the Cloud model and Data Lake technologies to reap the benefits of both. We envision a Data Lake future that pervades multiple clouds and technologies with a seamless data fabric created by a data management layer.
There's more to discover with data lake architecture in the cloud:
About the Author
Ben Sharma, is CEO and co-founder of Zaloni. He is a passionate technologist with experience in business development, solutions architecture, and service delivery of big data, analytics and enterprise infrastructure solutions. Having previously worked in management positions for NetApp, Fujitsu and others, Ben’s expertise ranges from business development to production deployment in a wide array of technologies including Hadoop, HBase, databases, virtualization and storage. Ben is the co-author of Java in Telecommunications and holds two patents. He received his MS in Computer Science from the University of Texas at Dallas.Follow on Twitter More Content by Ben Sharma