Zaloni Zip: Data Lineage

December 2, 2016 Adam Diaz

Maintaining a lineage of data in your data lake is not just a “nice to have” feature. Many organizations from Finance to Healthcare face government regulations around not only privacy but also data lineage. In fact, “how” data traverses the data lake from ingestion through transformation is often a legal matter.



So, how do you gain access to this information? Consider your data lake management tools. Regardless of the data lake solutions that you use or evaluate, look for the following data preparation capabilities:

  • Data tagging so that searching and sorting becomes easier
  • Converting data formats to make executing queries against the data faster
  • Executing complex workflows to integrate updated or changed data

Whenever you do any of these things, you need metadata that shows the lineage from a transformation perspective: What queries were run? When did they run? What files were generated? You need to create a lineage graph of all the transformations that happen to the data as it flows through the pipeline.

It’s possible that other issues may arise - you may have changes in data coming from source systems. How do you have reconcile that changed data with the original data sets you brought in? You should be able to maintain a time series of what happens over a period of time.

Again, a data management platform will do all of this. Moreover, it will ensure that all necessary data preparation is completely before the data is published in the data lake for consumption.

The Zaloni Bedrock platform allows you to automatically orchestrate and manage the data preparation process from simple to complex so that when your users are ready to analyze the data, the data is available. The platform also provides a robust RESTful API layer to allow integration with other systems for tracking and reporting.

For more on data lineage and transformations, read the self-service data presentation article.


About the Author

Adam Diaz

Director of Field Engineering Sales - RTP Raleigh NC

More Content by Adam Diaz
Previous Article
Validating Data in the Data Lake: Best Practices
Validating Data in the Data Lake: Best Practices

Can you trust the data in your data lake? Many companies are guilty of dumping data into the data lake with...

Next Article
Metadata is Critical for Fishing in the Big Data Lake
Metadata is Critical for Fishing in the Big Data Lake

Excerpt from report, Managing the Data Lake: Moving to Big Data Analysis, by Andy Oram, editor at O’Reilly ...


Get the latest tips and how-to's delivered straight to your inbox!

First Name
Last Name
Zaloni Blog Email Subscription
Thank you!
Error - something went wrong!