You’ve heard it time and time again: cloud is the future; those who don’t adopt modern big data practices will fall behind the pack; the next wave of IT disruption is right around the corner. And yet, at the same time, budgets are shrinking, demand is growing and pressure on the IT organization to show value is at an all-time high. As an executive, you have the full force of your business behind you and more options than ever to achieve both short- and long-term goals with business data. So many options, in fact, that the landscape has become a confusing, often contradictory mess of competing solutions.
Hadoop is one of the most widely adopted next-generation big data frameworks and is also one of the worst offenders as far as being confusing. MapR, Cloudera or HortonWorks? Flume, Sqoop, Kafka or NiFi? Spark or MapReduce? All of the many offerings in the Hadoop ecosystem have their strengths and weaknesses, but many are unrealistically sold as a silver bullet to solve an array of business problems. Likewise, the data lake architecture, although younger than Hadoop, holds great promise. However, this architecture can also be confusing for business leaders as it becomes more pervasive in the market. Then, choosing a cloud provider adds another level of uncertainty. AWS, Azure, or Google Cloud? So, where do you start when you need concrete, proven big data solutions?
Some basics about Hadoop and data lakes
You may know nothing about data lakes or Hadoop, you might have heard of them in passing, or you might already be rolling out a pilot. This post is meant to serve equally as an introduction and a reminder of some of the strengths and basic uses of Hadoop and data lakes. To level-set, let me first define Hadoop and data lakes.
Hadoop often refers to all of the many interrelated big data software products created under the umbrella of the Apache Foundation. Hadoop has also come to refer to bundles of these products sold by third-party vendors such as MapR, Cloudera and HortonWorks, among many others.
Data lakes are architectures of (usually enterprise-level) data storage, management and governance. In this architecture, raw data is ingested into the “lake,” where it resides in an unaltered state until it is needed by the organization; it can then be processed, enriched and extracted without losing fidelity or metadata surrounding the raw data.
What Hadoop and data lakes are NOT
Before jumping into what Hadoop and Data Lakes can do, here’s what they can’t do.
Hadoop is not a drop-in replacement for traditional database systems
For everything that Hadoop is, it is not simple. It is radically different from traditional Oracle and IBM implementations and, although it is amazingly powerful, it is not one-size-fits-all. All of the nuances and subtleties would take a whole series of posts on their own, but for now understand that there is a place and function for both Hadoop and RDMS in a cutting-edge IT organization, especially when highly transactional processes are common.
Data lakes are not a wholesale replacement for data warehouse architectures
As tempting as it may be to rip out all of your EDW architecture and transition to a data lake, this is equivalent to opening all of the floodgates at once before the dam is built. Like gushing water that floods its surrounds, this approach will flood data owners, IT managers and other end users—and not in a good way. A steady, carefully planned transition may or may not involve completely removing EDWs, even at full implementation. This is why, in a thoughtful data lake architecture such as in the below diagram, EDWs may still be present. The EDW portion may be significantly smaller in this architecture than in an EDW-only implementation, but it may never be completely eliminated.
Data lakes and Hadoop are not set-it-and-forget-it systems
For very different reasons, Hadoop and a data lake architecture require expert, hands-on management throughout their lifecycles.
- Hadoop is an ecosystem of Apache-managed open source projects. As such, it is constantly changing, evolving and shifting. Being abreast of the most current changes in each project is critical to long-term success.
- Data lakes, if left unmanaged, can quickly become messy and unmanageable, creating a lack of transparency into the processes and origins of data, and growing in size and complexity until they are no longer efficient or cost effective. This is where products such as Zaloni’s data lake management platform can be leveraged to effectively automate, manage and govern the data lake.
Now, the upside of Hadoop and data lakes
So we’ve reviewed some of the challenges with Hadoop and data lakes. They are all valid and are all characteristic of a cutting-edge, young technology—as you know, innovation is never without heightened risk. However, these two technologies have massive upside, especially given recent advances that help manage and mitigate the risks. Let’s take a look at some of the key strengths of both Hadoop and data lakes.
Hadoop is massively powerful, flexible and scalable to almost any business problem
Hadoop was originally designed precisely because legacy tools could not handle the volumes of data being produced at the onset of the modern digital age; since then, data has grown exponentially, but so has computing power. Being both open source and parallel-compute-driven, Hadoop is poised to continue leveraging advances in both software and hardware for years to come.
Distributed computing is cost effective and future proof
With Hadoop and a data lake, the bulk of your hardware is low-cost, highly redundant storage and servers. There’s no need for ultra-high-end servers that eat up swaths of budget before you even get into the services to configure them and that may need to be replaced in five years. Even more so, as compute power and storage gets cheaper, you get to reap the benefit of upgrading as much as you want, when you want.
The data lake leverages Hadoop to its full potential
Instead of rigid, lengthy, pre-defined workflows that have to be defined before you even receive data, the data lake allows you to store data in its pure form by leveraging the distributed, low-cost storage of Hadoop. The flexible nature of Hadoop also allows for easier ingestion than possible in traditional systems, meaning the data lake can store more data without requiring excessive middleware to translate or normalize. And, by providing a unified interface for on-demand data preparation, transformation and enrichment, self-service data preparation tools such as Zaloni's Platform offer more efficiency, convenience and insight to data scientists and other users than traditional RDBMS systems can.
How to evaluate if you are ready for a data lake
So, should you jump into the lake and embrace these emerging technologies? Here are a few questions to ask as you evaluate your options.
Just how big is your data?
Are you dealing with terabytes, petabytes, exabytes or zettabytes? Perhaps more importantly, how much will that data grow in the next five years, and will the solution you choose today be able to scale as your business grows? If you are already dealing with, or expect to deal with, more than a few petabytes of data, a data lake might be the right choice.
What are your short-term and long-term goals?
Is expansion, growth and evolution of your data storage and processing capabilities a priority, or is more weight placed on “run-the-business” processes? What critical business processes exist that may be rendered inefficient if your organization fails to keep up? And, of course, what is the current budget and forecast outlook? Data lake implementations range in size, complexity and cost, but in the long-term will almost always end up being simpler and cheaper to maintain for massive amounts of data than RDMS.
What is the most important service characteristic you provide to end-users?
Do your users expect 24x7 uptime and availability, or is data fidelity the most important part of your services? Are your applications and data mostly transactional, or is data flow largely limited to ingestion, with interaction occurring in limited capacity? Although advances have been made in Hadoop that allow more flexibility in transactional applications, traditional RDMS systems still have an edge over Hadoop when it comes to these implementations. Fortunately, data lakes provide the flexibility to maintain EDW and RDMS systems where needed and leverage the cost, scalability and redundancy benefits of Hadoop elsewhere.
How much risk does my organization tolerate?
If you’ve answered the previous questions and still believe that a data lake is applicable to your business, the final question revolves around risk tolerance of your organization. Data lakes can be a tricky business, even with expert guidance.
Your next steps
My first recommendation of what to do next would be to continue researching and reading about Hadoop and data lakes. Then, talk to some experts—people who have experience implementing and using data lakes. Also, look into what tools are available to make data lake implementation and management easier. Products such as Zaloni’s data lake management platform exist to help ensure that your first foray into the data lake is not your last, by simplifying management, automating ingestion and processing, reducing manual interaction, and providing proven consulting and data science guidance services.