Maintaining a lineage of data in your data lake is not just a “nice to have” feature. Many organizations from Finance to Healthcare face government regulations around not only privacy but also data lineage. In fact, “how” data traverses the data lake from ingestion through transformation is often a legal matter.
So, how do you gain access to this information? Consider your data lake management tools. Regardless of the data lake solutions that you use or evaluate, look for the following data preparation capabilities:
- Data tagging so that searching and sorting becomes easier
- Converting data formats to make executing queries against the data faster
- Executing complex workflows to integrate updated or changed data
Whenever you do any of these things, you need metadata that shows the lineage from a transformation perspective: What queries were run? When did they run? What files were generated? You need to create a lineage graph of all the transformations that happen to the data as it flows through the pipeline.
It’s possible that other issues may arise - you may have changes in data coming from source systems. How do you have reconcile that changed data with the original data sets you brought in? You should be able to maintain a time series of what happens over a period of time.
Again, a data management platform will do all of this. Moreover, it will ensure that all necessary data preparation is completely before the data is published in the data lake for consumption.
The Zaloni Data Platform allows you to automatically orchestrate and manage the data preparation process from simple to complex so that when your users are ready to analyze the data, the data is available. The platform also provides a robust RESTful API layer to allow integration with other systems for tracking and reporting.
For more on data lineage and transformations, read the self-service data presentation article.
About the Author
Big data & Hadoop thought-leaderMore Content by Adam Diaz