Save Money in the Cloud with Transient Clusters for Big Data

February 25, 2016 Scott Gidley

Cheaper, faster. Faster, cheaper. When it comes to getting value from big data, paying less and processing it faster to reduce time-to-insight are always top-of-mind goals. To achieve these goals, many enterprises are turning to the cloud to augment their on-premise Hadoop infrastructure or replace it.

Pay only for what you need

One key reason for the shift is that Hadoop in the cloud allows for the decoupling of storage and compute services, so enterprises can pay for storage at a lower rate than for computing services. Also, the cloud provides the unlimited scalability that on-premise architecture can’t. With cloud services like AWS EMR or Microsoft Azure HD Insight, enterprises can spin up and scale Hadoop clusters on demand. Have a job that isn’t processing fast enough? Add more nodes and then scale back down when it’s done. Have several jobs of various sizes? Run multiple clusters of exactly the size needed so that no resources are wasted. Add transient clusters to the mix, and the cloud becomes an extremely customizable big data solution.

Leverage transient clusters

Transient clusters are compute clusters that automatically shut down and stop billing when processing is finished. However, using this cost-effective approach has been an issue in the past, as metadata is automatically deleted by the cloud provider when a transient cluster is shut down. Therefore, most enterprises have opted to pay for persistent compute across the board in order to maintain the metadata.

Now with a data management platform like Bedrock, enterprises can leverage transient clusters for cost-savings and maintain their metadata. How does it work? In Bedrock’s case, the data management platform monitors the ingestion of the data that’s being loaded to the transient cluster in the cloud and stores the resulting metadata outside EMR/HD Insight. That way, the metadata is still available after the cluster is terminated.

Why is this important? Metadata is the key to getting value from big data. It’s the technical, operational and business information about the data that allows users to find the data they need in the data lake, verify its quality and trust the validity of their analyses and business intelligence.  

A hybrid approach

Moving storage and applications to the cloud isn’t an all-or-nothing proposition. In reality, most enterprises are employing a hybrid approach to the data lake, with some data storage—perhaps of less sensitive, third-party data—and processing—including transient clusters—in the cloud, and some on-premise. An intelligent Hadoop data lake management platform, like Bedrock, is flexible and provides a centralized way to manage on-premise and cloud-based computing across the enterprise. Is it time for your enterprise to leverage the cloud? We can help you weigh the pros and cons; please contact us.

About the Author

Scott Gidley

Scott Gidley is Vice President of Product Management for Zaloni, where he is responsible for the strategy and roadmap of existing and future products within the Zaloni portfolio. Scott is a nearly 20 year veteran of the data management software and services market. Prior to joining Zaloni, Scott served as senior director of product management at SAS and was previously CTO and cofounder of DataFlux Corporation.

More Content by Scott Gidley
Previous Article
Data Lakes Can Ensure That the Promise of Personalized Medicine is Here to Stay

Data Lakes can ensure that the promise of personalized medicine is here to stay As early as 2007 the FDA a...

Next Article
Keeping it In-Memory: A Look at Tachyon/Alluxio
Keeping it In-Memory: A Look at Tachyon/Alluxio

Overcoming output limitations A core tenet of the Hadoop cluster architecture is the replication of data bl...

×

Get the latest tips and how-to's delivered straight to your inbox!

First Name
Last Name
Zaloni Blog Email Subscription
Thank you!
Error - something went wrong!