Zaloni Zip: Data Lineage

December 2, 2016 Adam Diaz

Maintaining a lineage of data in your data lake is not just a “nice to have” feature. Many organizations from Finance to Healthcare face government regulations around not only privacy but also data lineage. In fact, “how” data traverses the data lake from ingestion through transformation is often a legal matter.



So, how do you gain access to this information? Consider your data lake management tools. Regardless of the data lake solutions that you use or evaluate, look for the following data preparation capabilities:

  • Data tagging so that searching and sorting becomes easier
  • Converting data formats to make executing queries against the data faster
  • Executing complex workflows to integrate updated or changed data

Whenever you do any of these things, you need metadata that shows the lineage from a transformation perspective: What queries were run? When did they run? What files were generated? You need to create a lineage graph of all the transformations that happen to the data as it flows through the pipeline.

It’s possible that other issues may arise - you may have changes in data coming from source systems. How do you have reconcile that changed data with the original data sets you brought in? You should be able to maintain a time series of what happens over a period of time.

Again, a data management platform will do all of this. Moreover, it will ensure that all necessary data preparation is completely before the data is published in the data lake for consumption.

The Zaloni Data Platform allows you to automatically orchestrate and manage the data preparation process from simple to complex so that when your users are ready to analyze the data, the data is available. The platform also provides a robust RESTful API layer to allow integration with other systems for tracking and reporting.

For more on data lineage and transformations, read the self-service data presentation article.

About the Author

Adam Diaz

Big data & Hadoop thought-leader

More Content by Adam Diaz
Previous Article
Zaloni Zip: Using Transient Clusters and Keeping Your Metadata
Zaloni Zip: Using Transient Clusters and Keeping Your Metadata

As the name suggests, transient clusters are compute clusters that automatically shut down and stop billing...

Next Article
Train Your (Hadoop) Elephant with Fewer Data Lake Management and Governance Tools
Train Your (Hadoop) Elephant with Fewer Data Lake Management and Governance Tools

In the past year, the focus of big data has expanded from creating new streaming and computing frameworks i...

Want a governed, self-service data lake?

Contact Us