On-Premise or in the Cloud, It’s All the Same for a Hybrid Data Lake

February 9, 2016 Ben Sharma

When it comes to big data, more and more enterprises are beginning to embrace the cloud for its flexibility. Many of these companies are adopting a hybrid approach to their big data lakes, looking for ways to leverage efficiencies and opportunities of cloud-based applications and storage alongside their on premise data.

A Hadoop data lake management platform such as Bedrock, can span on-premise and cloud-based computing across the enterprise. By capturing the metadata needed to implement consistent data management and governance processes, companies are able to confidently use the cloud as a Hadoop data lake for core use cases that couldn’t be considered before.

As you know, one of the cloud’s advantages is its rapid elasticity—the ability to provision and pay for just the resources required in real time, on the fly. Also, storage and compute services can be decoupled in the cloud. Both of these capabilities have significant implications in terms of lowering the barrier to entry for companies of all sizes to derive value from big data.

Let’s take a look at how these capabilities, combined with a data management platform designed for Hadoop make the cloud a good option for your data lake.

1. Increased compute elasticity

With cloud services like Amazon Web Services (AWS) EMR or Microsoft Azure HD Insight, companies can spin up and scale Hadoop clusters as business demands. However, maintaining persistent clusters can be expensive, particularly for proof of concept projects and sandbox environments that may not produce a return on investment. Decoupled storage and compute make transient clusters—which automatically shut down and stop billing when processing is finished—a more cost-effective option. This allows administrators to work complex, repeatable workflows on the most comprehensive data sets in the most economical manner.

2. More cost-effective data storage

Today, processing requirements can be variable. Customers no longer need to duplicate data for the sake of accessing compute. By using a data management platform that maintains metadata, customers can scale up processing without having to scale up or duplicate storage. In addition to needing less storage, when storage and compute are separate, customers can pay for storage at a lower rate, regardless of computing needs. Cloud service providers like AWS even offer a range of storage options at different price points, depending on accessibility requirements.

3. Metadata maintenance across multiple clusters, including transient clusters

When a transient cluster is shut down, the metadata is automatically deleted by the cloud provider. To gain the greatest value from transient clusters, use a data management platform to monitor ingestion of the data that’s being loaded to the cluster and store the resulting metadata outside EMR/HD Insight so that it’s available after the cluster is terminated.

4. More consumable data for business users

Metadata is what allows business users to confidently access and use data. With a data management platform, data residing in S3 or Azure Storage is automatically catalogued and users can easily provision serving layer data stores like Amazon Redshift for rapid data discovery and consumption. Also, as the number of business users increases, metadata enables companies to execute enterprise-wide data governance strategies for the management and use of data.

Make sure you manage your metadata

It’s important to note that although capturing unstructured and semi-structured data in AWS or Microsoft Azure is relatively straightforward, these cloud service providers do not offer an easy way to also capture metadata. Metadata is essential for managing, migrating, accessing and deploying big data—and leveraging many of the coveted benefits associated with the data lake architecture. A robust data management platform built for Hadoop is an essential component of your data lake that enables companies to implement consistent data management and governance processes across the environment.

A few weeks ago, I presented on a webinar on this topic with Ben Lorica of O’Reilly Media. You can access the recording here.

For more information about data lake management and governance, and how to get the most from your data lake in the cloud, contact us.

About the Author

Ben Sharma

Ben Sharma, is CEO and co-founder of Zaloni. He is a passionate technologist with experience in business development, solutions architecture, and service delivery of big data, analytics and enterprise infrastructure solutions. Having previously worked in management positions for NetApp, Fujitsu and others, Ben’s expertise ranges from business development to production deployment in a wide array of technologies including Hadoop, HBase, databases, virtualization and storage. Ben is the co-author of Java in Telecommunications and holds two patents. He received his MS in Computer Science from the University of Texas at Dallas.

Follow on Twitter More Content by Ben Sharma
Previous Article
ZooKeeper: A Real World Example of How to Use It
ZooKeeper: A Real World Example of How to Use It

It is a universal fact that although Hadoop is becoming the standard for big data processing, it is still a...

Next Article
The Importance of a Foundational Data Management Platform for Health and Life Sciences

An Introduction to Accountable Care Organizations (ACO's) and Health Data Exchange  Coordinating data acro...

×

Get the latest tips and how-to's delivered straight to your inbox!

First Name
Last Name
Zaloni Blog Email Subscription
Thank you!
Error - something went wrong!