Migrating On-Premises Data Lakes to Cloud

October 10, 2016 Ben Sharma

Migration Objectives

In the first blog of this series, we discussed some of the key drivers for a Cloud Data Lake such as:

  • the cost advantages of the elastic utility model of cloud, especially for highly variable workloads typical of Data Lake operations and Analytics processing
  • lower administrative and operational cost achieved by delegating the heavy lifting of configuration and platform maintenance to the cloud providers
  • access to a range of compute and storage options beyond Hadoop, as well as advanced cloud services
  • geographical coverage and data availability guarantees

But how do enterprises that already have an on-premises Data Lake migrate to the Cloud to realize those benefits?  Every cloud migration project has to begin with a clear statement of business as well as technical objectives. Cost reduction without loss of service levels and same or superior user experience tends to be the top business objective. While the current cost of on-premises data platform may be known, quantifying future costs of a Data Lake in the Cloud can be done only in the context of architectural decisions made after sorting through and picking from a bewildering array of options across cloud providers.

Business objectives further clarify the scope and time frame for the migration, compliance requirements with respect to data security, physical location and longevity of data, and business continuity needs during and after migration. Scope has to do with which on-premises data sets, from which enterprise functions or departments, from which on-premises Hadoop clusters and from which data centers, will migrate to the Cloud Data Lake. Compliance requirements are particularly important to highly regulated industries like healthcare and banking. It goes hand in hand with security needs founded on a well-researched threat model. Business continuity needs determine which are the critical applications that cannot tolerate any downtime during migration.

Technical objectives lay down the use cases that the data will be subjected to by a variety of users. These users may belong to different business functions and come with varying skill sets and data access and processing needs. The enterprise on-premises data governance practice may have been spread across the infrastructure, application and security teams. But that enterprise governance model will change in the cloud based on the choice of Identity and Access Management services and tools. Similarly, the data security objectives will determine how data will be secured in the cloud while in motion and at rest. Technology choices in the cloud in addition may be guided by the requirements around feature parity and data processing performance by comparison with the on-premises Data Lake.

A Data Lake management application layer greatly facilitates the realization of the business and technical objectives. It does this by abstracting from the user the underlying data platform technologies, whether on-premises or in the cloud, and by providing a common metadata view.

Cloud Technology Choices and Migration Design

While there are a multitude of technology choices and cloud providers, we see three broad models of Data Lake migration to the cloud:

  • Forklift migration of on-premises Hadoop cluster to cloud
  • Migration to use Hadoop based cloud services and cloud native storage
  • Migration to a hybrid on-premises/cloud model, using a variety of cloud native services, and establishing a seamless data fabric view with metadata

These are also reflective of the increasing levels of maturity of Cloud Data Lake adoption. There are of course variations of these models making more or less use of cloud elasticity with the help of a management framework.

Forklift migration refers to moving on-premises Hadoop cluster to one built ground up from basic compute instances in the Cloud. This is the simplest migration model leveraging existing staff skill sets. It uses only the IaaS aspect of cloud with persistent compute instances, typically with instance local storage. Except for infrastructure access, security is entirely the cloud customer’s responsibility, as is the creation, configuration, monitoring and maintenance of the cluster.

Moving from Hadoop on-premises to using Hadoop as a service from the Cloud provider is the second model of migration. Much of the heavy lifting around Hadoop cluster setup and configuration, and ensuring compatibility of Hadoop ecosystem components is left to the cloud provider. A Data Lake management application may aid in the creation and use of transient Hadoop clusters on demand and interface directly to cloud native persistent storage.

The third model of Data Lake migration involves a gradual transition from Hadoop on-premises to hybrid architectures - on-premises/cloud, using a variety of cloud native storage options and services in addition to the Hadoop ecosystem tools, adopting cloud service patterns for processing event streams, real time analytics, and machine learning. This model presupposes a metadata management layer to remove any mismatch between the underlying technologies and provide a seamless data fabric view across all the data regardless of storage location.

Between the three aforementioned migration models, the major Hadoop distributions (Cloudera, Hortonworks, MapR), the ever expanding Hadoop ecosystem tool variations they support, and the big three cloud service providers (AWS, Azure, GCP) each with unique service offerings and pricing, the options for migration are too numerous to list here. Meaningful comparisons will need to be done in the context of specific business and technical requirements.

Screen_Shot_2016-10-10_at_2.43.20_PM.png

A good migration design requires deep expertise in Data Lake and cloud technologies, and data pipeline design patterns, either developed internally or bought from a service provider.

Migration Planning and Execution

Data Lake migration planning typically starts with a proof of concept pilot to validate the technical choices, feature parity, and performance in the cloud. This is followed by a phased approach consistent with the chosen migration model that takes into account:

  • Infrastructure migration decisions - storage and compute, sizing, scaling, networking
  • Security of data and governance of data access, and resource usage in the cloud
  • Retooling data ingestion for sending to the cloud Data Lake data that is currently received by the on-premises platform from different sources
  • Detailed inventory of on-premises Data Lake, and mapping to cloud platform
  • Data transformation pipelines and corresponding translation to cloud mechanisms
  • Application migration - forklift vs rewrite, processes for development, test, and production
  • Data extraction tools and processes in the cloud for visualization, insights, or predictions
  • Migration options for historical data
  • Versions of cloud tools and application compatibility
  • Data Lake management applications

An execution plan which defines the transition process from on-premises to the Cloud Data Lake, testing, performance monitoring, and business continuity during and after the cutover, are critical to a successful migration.

Conclusion

The benefits of migrating a Data Lake from on-premises to the cloud are achieved only through a careful specification of business and technical objectives, a validated set of migration design choices, planning and phased execution. A metadata management application layer is invaluable during the transition as well as for future proofing the Data Lake solution in the cloud.

About the Author

Ben Sharma

Ben Sharma, is CEO and co-founder of Zaloni. He is a passionate technologist with experience in business development, solutions architecture, and service delivery of big data, analytics and enterprise infrastructure solutions. Having previously worked in management positions for NetApp, Fujitsu and others, Ben’s expertise ranges from business development to production deployment in a wide array of technologies including Hadoop, HBase, databases, virtualization and storage. Ben is the co-author of Java in Telecommunications and holds two patents. He received his MS in Computer Science from the University of Texas at Dallas.

Follow on Twitter More Content by Ben Sharma
Previous Article
What's to Love about Bedrock 4.2 and Mica 1.2
What's to Love about Bedrock 4.2 and Mica 1.2

While we were at Strata + Hadoop World New York last month, we announced two new releases: Bedrock 4.2 and ...

Next Article
Pig vs. Hive: Is There a Fight?
Pig vs. Hive: Is There a Fight?

Pig and Hive came into existence out of the sole necessity for enterprises to interact with huge amounts of...

×

Get the latest tips and how-to's delivered straight to your inbox!

First Name
Last Name
Zaloni Blog Email Subscription
Thank you!
Error - something went wrong!