Many enterprises are feeling let down by the promise of big data and data lakes because they’re not seeing the return on their investment that they expected. Their original approach to data lake management wasn't scalable; therefore, a new approach is required to help them derive significant value from their data. Specifically, these enterprises need a way to effectively manage and govern all of their data and make it readily available for use for various levels of users, beyond data scientists.
Enter the managed data lake
We think this new approach involves managed data lakes. In fact, we predict the future enterprise data ecosystem will have a managed data lake at its core. The data lake will be fed by multiple structured data sources, real-time data streams and unstructured data. All of the data will be stored in this central repository, in the cloud, on premise or in some combination, where it can be transformed, cleaned and manipulated by data scientists, data analysts and general business users. Then, prepared datasets can be fed back into the data warehouse for business intelligence, or to other visualization tools for data science, data discovery, analytics, predictive modeling and reporting.
Metadata and data governance
Many early adopters used a Hadoop data lake as a relatively inexpensive storage solution and dumped data into it without much of a plan, expecting they would figure it out when they needed to use the data. The problem is, there is just too much data of various quality in too many formats. To keep track of it all and enable data governance, data must be managed upon ingestion. Data management is achieved by layering on a data management platform to your data lake that applies metadata and defines, tracks and logs all steps of what data is ingested into the data lake. A data lake management platform is what provides the essential data visibility, reliability, security and privacy controls, provides an understanding of data quality, and can allow broader access to data by multiple users.
Common challenges to consider
Transitioning to a data lake architecture is difficult, particularly for enterprises that have multiple legacy data platforms and applications. Even for enterprises that were early adopters of the data lake strategy, many still struggle with issues like data quality, visibility, security, privacy and governance. To avoid some of these issues, it’s good to be aware of some the most common potential challenges so that you can proactively tackle them. The following are some common challenges we regularly see enterprises facing:
- Ensuring a successful deployment. Your proof of concept (POC) may have gone well, but now you need to operationalize your data lake for new use cases and integrate it into daily business practices across the enterprise.
- Navigating the complexity. You have to contend with an ecosystem that includes hardware, software and applications. Hadoop requires you to integrate multiple tools to successfully build a managed data lake.
- Solving the skills gap. Implementing a managed data lake requires a specific skill set—one that many development and architecture professionals may not have, making talent hard to find and costly to hire. A Gartner survey revealed that 57% of enterprises say they are not ready to adopt Hadoop because of skills gaps.
- Staying relevant. Hadoop is a relatively new technology and its ecosystem is constantly changing as the community develops new tools and solutions to increase data availability, and make data processing and analysis faster, storage more efficient and programming simpler.
- Addressing technology and business challenges, including governance. Without a systematic and automated way to manage and govern data, you won’t know what data you have, be able to trust your data quality, provide access to data for multiple users, or comply with security and privacy regulations.
Your next steps
Once you determine your strategy and are ready to deploy a managed data lake, know that there are tools that can help simplify the processes along the entire data pipeline, including ingestion, data management, orchestration of workflows and self-service data preparation. Don’t wait. Companies that are able to leverage their data have a significant competitive advantage over those that don’t. The time to start exploring how to integrate a managed data lake into your architecture is now.
About the Author
Tony Fisher has been in the data management business for over 25 years, most recently as the Chief Technology Officer for the Progress Software data management business. Prior, he was President and Chief Executive Officer of DataFlux Corporation, a SAS Company. Tony is a sought after speaker and author.More Content by Tony Fisher