Zaloni Zip: Data Lineage

December 2, 2016 Adam Diaz

Maintaining a lineage of data in your data lake is not just a “nice to have” feature. Many organizations from Finance to Healthcare face government regulations around not only privacy but also data lineage. In fact, “how” data traverses the data lake from ingestion through transformation is often a legal matter.

 

 

So, how do you gain access to this information? Consider your data lake management tools. Regardless of the data lake solutions that you use or evaluate, look for the following data preparation capabilities:

  • Data tagging so that searching and sorting becomes easier
  • Converting data formats to make executing queries against the data faster
  • Executing complex workflows to integrate updated or changed data

Whenever you do any of these things, you need metadata that shows the lineage from a transformation perspective: What queries were run? When did they run? What files were generated? You need to create a lineage graph of all the transformations that happen to the data as it flows through the pipeline.

It’s possible that other issues may arise - you may have changes in data coming from source systems. How do you have reconcile that changed data with the original data sets you brought in? You should be able to maintain a time series of what happens over a period of time.

Again, a data management platform will do all of this. Moreover, it will ensure that all necessary data preparation is completely before the data is published in the data lake for consumption.

The Zaloni Data Platform allows you to automatically orchestrate and manage the data preparation process from simple to complex so that when your users are ready to analyze the data, the data is available. The platform also provides a robust RESTful API layer to allow integration with other systems for tracking and reporting.

For more on data lineage and transformations, read the self-service data presentation article.

About the Author

Adam Diaz

Big data & Hadoop thought-leader

More Content by Adam Diaz
Previous Article
Validating Data in the Data Lake: Best Practices
Validating Data in the Data Lake: Best Practices

Can you trust the data in your data lake? Many companies are guilty of dumping data into the data lake with...

Next Article
Big Data Maturity Stages: Is Your Data Ready to Be a Product?
Big Data Maturity Stages: Is Your Data Ready to Be a Product?

The idea of turning your business data into a product, also termed “data as a product,” is a known concept ...

Want a governed, self-service data lake?

Contact Us