Data Lake Archiving: Hadoop or the Cloud?

December 8, 2016 Ben Sharma

The storage layer of the data lake is evolving. A few years ago, when we talked about the data lake, it was generally understood that Hadoop was the underlying platform for everything related to the data lake. Today, thanks to the cloud, that is no longer the case. Deploying a data lake in the cloud enables you to decouple storage and compute functions and use the storage platform that is best suited and most cost-effective for your needs – which may not be Hadoop.

If you're considering a strategy for migrating storage to the cloud within your data lake architecture, archival storage is a key area to target for lowering costs. But it’s important to do it in an automated and process-driven way that allows for transparency and scalability – not to mention access, as more enterprises become interested in using historical data for data analytics.

What are your options today to integrate a more cost-effective archival solution within the data lake architecture? How do you structure your data lifecycle management policies for the data lake to accommodate archiving?

Integrating a non-HDFS archival solution

Techniques to Establish your Data Lake: How to Achieve Data Quality and Security

You don't have to ingest data into HDFS in order to make it available for a Hadoop application, as long as you maintain the HDFS protocol. Therefore, to help save on costs, we typically advocate moving archival storage out of Hadoop and into the cloud. In the cloud, you can take advantage of a non-block-based file system, such as an object store like Amazon S3, or lower-cost, long-term storage for cold data like AWS Glacier, when speedy retrieval isn’t a requirement.

Block vs. object storage

Block storage has long been the norm, as traditional file systems break files down into blocks before storing them. Individual blocks don’t have metadata associated with them; it’s only when they’re accessed and combined with other blocks that they create a file that is defined in a way we can understand.

Object storage is a bit different. Each file is bundled with its metadata as a single “object,” named with an object ID and stored in a flat structure. You retrieve the whole object via its object ID. This enables fast access at scale. Notably, for data analytics, the metadata associated with the object is unlimited in terms of type and amount, and can be customized by users. Object storage provides a simple way to manage archival storage across locations, while providing an array of rich metadata.

Big data lifecycle management

Once you decide on the best platform to meet your archival storage needs, next consider how you would implement a data lifecycle management strategy and a transparent, policy-based system. What are the rules, based on the age and/or usage of the data that define when data moves from a data block-based file system like HDFS into an archival platform like S3 or Glacier? How will you maintain the metadata so that you can still run queries on the archived data?  What is a reasonable timeframe to access the data – it may take longer, but that may be acceptable based on the layers you have defined.

Managing the lifecycle of data at the scale of “big” can be challenging. That’s why Zaloni offers a DLM capability, which gives enterprises the ability to create and automate global and specific data retention policies for data in the data lake based on whatever makes sense for the business, including age and relevancy. You can use our DLM capability to apply metadata, define storage tiers in Hadoop, delete old data and export data from HDFS to more cost-effective storage in the cloud.

Ready to consider an integrated, hybrid approach to storage for data in your data lake? Archival storage can be a low-hanging fruit when it comes to cutting costs, so it’s worthwhile to explore your options. And the good news is that there are tools to help you implement a sound, policy-based data lifecycle management strategy that is customized to meet your business needs.

If you’d like to listen to the webinar where this topic originated, you can access the replay here.

 

Techniques to Establish your Data Lake: How to Achieve Data Quality and Security

 

About the Author

Ben Sharma

Ben Sharma, is CEO and co-founder of Zaloni. He is a passionate technologist with experience in business development, solutions architecture, and service delivery of big data, analytics and enterprise infrastructure solutions. Having previously worked in management positions for NetApp, Fujitsu and others, Ben’s expertise ranges from business development to production deployment in a wide array of technologies including Hadoop, HBase, databases, virtualization and storage. Ben is the co-author of Java in Telecommunications and holds two patents. He received his MS in Computer Science from the University of Texas at Dallas.

Follow on Twitter More Content by Ben Sharma
Previous Article
How to Choose a Hadoop Distribution: Understanding Versioning
How to Choose a Hadoop Distribution: Understanding Versioning

This is the first in a multi-part series of blogs discussing Hadoop distribution differences to help enterp...

Next Article
Open Source: How Open Is It?
Open Source: How Open Is It?

This is the second in a multi-part series of blogs discussing Hadoop distribution differences to help enter...

Want an agile, governed data lake?

Contact Us