Data Lake Archiving: Hadoop or the Cloud?

December 8, 2016 Ben Sharma

The storage layer of the data lake is evolving. A few years ago, when we talked about the data lake, it was generally understood that Hadoop was the underlying platform for everything related to the data lake. Today, thanks to the cloud, that is no longer necessarily the case. Why? Deploying a data lake in the cloud enables you to decouple storage and compute functions and use the storage platform that is best suited and most cost-effective for your needs – which may not be Hadoop.

Recently, I participated in a webinar with O’Reilly Media. During the webinar we received some excellent questions. One of the questions led to a specific discussion about big data archiving solutions within a data lake architecture.

Archival storage is a key area to target for lowering costs, but it’s important to do it in an automated and process-driven way that allows for transparency and scalability – not to mention access, as more and more enterprises become interested in using historical data for data analytics. What are your options today to integrate a more cost-effective archival solution within the data lake architecture? How do you structure your data lifecycle management policies for the data lake to accommodate archiving?

Integrating a non-HDFS archival solution

You don't have to ingest data into HDFS in order to make it available for a Hadoop application, as long as you maintain the HDFS protocol. Therefore, to help save on costs, we typically advocate moving archival storage out of Hadoop and into the cloud. In the cloud, you can take advantage of a non-block-based file system, such as an object store like Amazon S3, or lower-cost, long-term storage for cold data like AWS Glacier, when speedy retrieval isn’t a requirement.

Block vs. object storage

Block storage has long been the norm, as traditional file systems break files down into blocks before storing them. Individual blocks don’t have metadata associated with them; it’s only when they’re accessed and combined with other blocks that they create a file that is defined in a way we can understand.

Object storage is a bit different. Each file is bundled with its metadata as a single “object,” named with an object ID and stored in a flat structure. You retrieve the whole object via its object ID. This enables fast access at scale. Notably, for data analytics, the metadata associated with the object is unlimited in terms of type and amount, and can be customized by users. Object storage provides a simple way to manage archival storage across locations, while providing an array of rich metadata.

Big data lifecycle management

Once you decide on the best platform to meet your archival storage needs, next consider how you would implement a data lifecycle management strategy and a transparent, policy-based system. What are the rules, based on the age and/or usage of the data that define when data moves from a data block-based file system like HDFS into an archival platform like S3 or Glacier? How will you maintain the metadata so that you can still run queries on the archived data?  What is a reasonable timeframe to access the data – it may take longer, but that may be acceptable based on the layers you have defined.

Managing the lifecycle of data at the scale of “big” can be challenging. That’s why Zaloni offers Bedrock DLM, which gives enterprises the ability to create and automate global and specific data retention policies for data in the data lake based on whatever makes sense for the business, including age and relevancy. You can use Bedrock DLM to apply metadata, define storage tiers in Hadoop, delete old data and export data from HDFS to more cost-effective storage in the cloud.

Ready to consider an integrated, hybrid approach to storage for data in your data lake? Archival storage can be a low-hanging fruit when it comes to cutting costs, so it’s worthwhile to explore your options. And the good news is that there are tools to help you implement a sound, policy-based data lifecycle management strategy that is customized to meet your business needs.

If you’d like to listen to the O’Reilly Webinar where this topic was discussed, you can access the replay here.

About the Author

Ben Sharma

Ben Sharma, is CEO and co-founder of Zaloni. He is a passionate technologist with experience in business development, solutions architecture, and service delivery of big data, analytics and enterprise infrastructure solutions. Having previously worked in management positions for NetApp, Fujitsu and others, Ben’s expertise ranges from business development to production deployment in a wide array of technologies including Hadoop, HBase, databases, virtualization and storage. Ben is the co-author of Java in Telecommunications and holds two patents. He received his MS in Computer Science from the University of Texas at Dallas.

Follow on Twitter More Content by Ben Sharma
Previous Article
Building a Machine Learning Engine from Big Data
Building a Machine Learning Engine from Big Data

Machine learning (ML) is still growing as a field in big data and has of late made some significant advance...

Next Article
Zaloni Zip: Using Transient Clusters and Keeping Your Metadata
Zaloni Zip: Using Transient Clusters and Keeping Your Metadata

As the name suggests, transient clusters are compute clusters that automatically shut down and stop billing...

×

Get the latest tips and how-to's delivered straight to your inbox!

First Name
Last Name
Zaloni Blog Email Subscription
Thank you!
Error - something went wrong!