What is a Cloud Data Lake?

October 3, 2018 Aashish Majethia

Many organizations that we talk to are interested in leveraging cloud infrastructure as their data lake. They’re smart to consider it. It’s a highly flexible deployment where you only pay for the compute and storage. For companies that have highly varying levels of processing needs, this paradigm can offer a significantly lower price point and shifts management of hardware to a third party.

What is a data lake?

Contrary to what some organizations have been led to believe, a cloud-based data lake is not an S3 bucket where data is dumped. A data lake is a maintainable, functioning infrastructure that maintains governance across all of the data. It provides access to the correct people at the appropriate stages of the data lifecycle and can adhere to a zone-based architecture specific to an organization’s needs. A data lake should also provide self-service access to end users reducing overhead on IT.

Benefits of running in the cloud

Cloud providers have developed a plethora of services and tools that can be used by organizations in multiple ways. This means cloud subscribers have lots of pieces they can build their infrastructure upon. The cost to try (and potentially fail with) a number of options that could work is minimal.

An organization can develop upon the tools the cloud providers have to develop a fully functioning data lake. They can start small and scale out if necessary. In short, the cloud provides a scalable architecture with low upfront cost. As the needs of the organization increase, they can scale their compute, storage, and application requests.

In a cloud-based infrastructure, an organization only pays for the amount they use. For example, if an organization has high compute needs, but for short bursts, they are ideal candidates for savings.

Downside of the Cloud

Although cloud vendors offload a lot of the risk associated with data storage and security, those risks are still very real. The cloud vendor of choice needs to take this sort of risk into account. Data access pipelines need to be accounted for - can the customer send/receive the data at the speed necessary?

An oft-forgotten issue is the risk associated from choosing only one cloud vendor. If a cloud vendor suddenly decides to increase their prices by 20%, this can wreak havoc on an organization’s IT budget. Many organizations are now realizing the reality of vendor risk and are seeking solutions that provide multi-cloud support to eliminate that risk.

An Ideal Option

Zaloni’s Data Platform orchestrates the ingestion, transformations, tokenization and masking of sensitive/PII data, and provisioning to databases. Zaloni’s data lake management system provides an abstraction layer leveraging the native compute and storage of underlying infrastructure. Thanks to its flexible architecture, it can natively work with multiple cloud (or on-premises) infrastructures.

As of this writing, it is the only data lake management solution that can provide a layer on top of a multi-cloud environment. This is a key win for companies who have appropriate risk minimization goals.

Learn more about cloud data lakes and the architecture needed to realize them in this webinar presented by Rajesh Nadipalli.

Building a Governed Data Lake in the Cloud

 

About the Author

Aashish Majethia

Sales Engineer Sales - Remote

More Content by Aashish Majethia

No Previous Articles

Next Article
How to Build Your Own Sqoop Plugin
How to Build Your Own Sqoop Plugin

Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured data s...

×

Subscribe to the latest data lake expertise!

First Name
Last Name
Company
I would like to subscribe to email updates about content and events.
Zaloni is committed to the best experience for you. Read more on our Privacy Policy.
Thank you!
Error - something went wrong!