Data ingestion is about much more than getting your data into the data lake. Think about it this way: designing your ingestion process is like setting up a digital “factory,” with inputs and expected outputs. That’s why when the data comes in, you have to be able to monitor whether your factory is delivering outputs reliably and consistently. You need to be able to direct the data to the right place as it is ingested, and move it along the “assembly line.” Also, you need to know in real time when something breaks down and diagnose it accurately, so that you can get your widget-making processes up and running again, fast.
A major reason why some data lakes fail is an inability to manage their complexity. On average, a data lake can consist of more than 20 different technologies, using different languages, managing different data types, supporting various business purposes. For most of us, it’s not possible to have expertise this broad and deep. However, even if you have such a genius on your team, what if he or she goes on vacation? Leaves your company? Creating a data ingestion “factory” that is transparent and manageable by mere mortals on your team is in everyone’s best interest.
What does a data ingestion factory look like? It’s important to think beyond bringing in the data and develop a sound strategy to create repeatable and consistent processes to manage the entire lifecycle of your data.
Tip #1: Automate everything
Many businesses use a company-wide operations management or scheduling tool for data lake ingestion. For Hadoop-based data lakes, many administrators use a combination of command line technologies like shell and python scripts along with specific Hadoop technologies for automation. These solutions provide some level of automation, but the problem is that although they can launch the ingestion process, they generally cannot manage it and do not assist in root cause analysis when something fails. We recently spoke to one company where it took three days to figure out what went wrong because their homegrown tool kept them in the dark. Three days is an eternity for companies dependent on real-time data for business purposes.
Ideally, your data lake management platform would not only automate ingestion into the lake, it also would automate metadata application, data quality checks, data transformations and data lifecycle management, and enable your team to easily do root cause analyses. It’s important to use a tool with an intuitive interface that allows you to manage your data pipelines in a reliable, understandable way.
Tip #2: Decide on your business - build vs. buy
The three day root cause investigation mentioned above speaks to the value of not only having a root cause capability in your chosen solution but also brings to light the issue of build vs buy in software choice. Most people are not in the business of building software. This issue is clouded in IT especially in the light of the fact many IT teams have been able to solve complex issues in the past using programming they have done as a group. For most groups writing software is a not their core business. While good attempts at professional software design are possible by highly technical groups, most groups eventually face the reality of the long term cost of such projects. The distraction from core business along with lack of budget for the general infrastructure required to produce quality software generally win the day. This is no different than an auto mechanic. While they could and sometimes do produce their own tools they by and large buy tools versus manufacturing tools (or parts) even though they probably possess the mechanical ability/aptitude to do so. Their core business is repair so why not focus on that mission. Let the tool makers provide you with high quality tools that make your business goals easier.
Tip #3: Scale it up
The reason you have a data lake is that you need to be able to ingest and store any volume of data – even during a spike. How will you design your ingestion process so it can scale with a spike in data from multiple data streams, and handle any number of various data types? Managing data at scale means a design without bottlenecks. You need an elastic and scalable ingestion platform and a comprehensive strategy for where the data will land, where and how it will be stored and for how long. Developing data management and retention policies will allow you to automate movement of data within the data lake – we recommend setting up landing, raw, trusted, refined and sandbox zones in your data lake – as well as from the data lake to less expensive storage options (or at least storing data in a location that makes sense based upon your use case).
Tip #4: Plan for variety
If you’re not at the very least making plans to capture and use streaming data, you’re definitely behind the curve. For some industries, such as automotive or telecommunications, massive volumes of streaming data are critical to the core business model. For other less-technical industries, such as retail, streaming is becoming vital for analyzing customer sentiment data that informs marketing and loyalty programs. When designing your ingestion process, it’s important to think about all of the different types of data, e.g., streaming data versus batch or files that are coming in from various sources, including databases and applications.
Data lake ingestion isn’t as simple as it seems at first glance – in fact, it’s key to the overall success of your big data strategy. This might include the use of cloud based Hadoop clusters, existing EDWs as well as transient clusters to handle activity spikes (or simply provide low OpEx).
By putting a sound strategy and the right ingestion platform in place to enable automation, scalability/elasticity and reliable ingestion of any data type, you are well on your way to creating a useful, valuable data lake.
About the Author
Big data & Hadoop thought-leaderMore Content by Adam Diaz