Automated Data Inventory

November 1, 2018 Carlos Guerra

There are numerous methods of ingesting files into a data lake and a plethora of point solutions that support them. Two very common methods prevail, but both of them have inherent problems:

Method 1. Using methods such as Pig scripts to ingest data into HDFS.

These scripts usually don't register the files (or any associated tables) with any existing catalog systems. When building a data catalog or data inventory, most solutions are only cataloguing what they directly ingested. Therefore, if you have this external ingestion (Pig script) there is no way the management system can be aware. The Zaloni platform can scan these external directories for input files and catalog these entities.

Method 2. Testing development directly in Hive before working with the data.

In this case, you create hive tables and then create pipelines that transform those hive tables. Most of the time, teams do this step before ever ingesting data to ensure the transformations work the way they think they will before running data through them. Even if teams use tools outside of the Zaloni platform to do this, the platform can scan a Hive DB, go through every single table associated with it (both internal and external), and either create or update existing entities based off of the changes. Again we can do this across thousands of tables as needed.

In both scenarios, the real value of the Zaloni Data Platform (ZDP) shows in two ways:

  1. You don’t have to start from nothing. The platform catalogs everything you have and immediately gets you into production.
  2. If you have teams making changes outside of our tool, the platform can still register those changes so the operations teams are aware.

Further, the automated data inventory in the Zaloni Data Platform can discover and operationalize the existing tables in your data silos. It scans input files at massive scale across thousands of directories to catalog all of the files, associate them with existing entities, or create new entities if they don't exist in the Zaloni platform. It can even register files and tables even if Zaloni tools weren't originally used to ingest the files.

In terms of self-service access to the business data, the platform can automatically sync with these tables. This means that no matter what tools developers and analysts are using on the data, you still have full access to the cataloging and operational information that’s critical for your business.

Watch this demo to see the Zaloni Data Platform in action and contact us to see how the ZDP can help your business gain full access to critical insights.

About the Author

Software Architect

More Content by Carlos Guerra
Previous Article
6 Reasons Why You Should Move Your Data Lake to the Cloud
6 Reasons Why You Should Move Your Data Lake to the Cloud

Now is the time to put the roadmap in place for transitioning your data lake to the cloud based on your ent...

Next Article
What is a Cloud Data Lake?
What is a Cloud Data Lake?

Whether you're planning to start a data lake on AWS, Azure, Google Cloud or a combination of the three, it ...

×

Subscribe to the latest data lake expertise!

First Name
Last Name
Company
I would like to subscribe to email updates about content and events.
Zaloni is committed to the best experience for you. Read more on our Privacy Policy.
Thank you!
Error - something went wrong!