There are numerous methods of ingesting files into a data lake and a plethora of point solutions that support them. Two very common methods prevail, but both of them have inherent problems:
Method 1. Using methods such as Pig scripts to ingest data into HDFS.
These scripts usually don't register the files (or any associated tables) with any existing catalog systems. When building a data catalog or data inventory, most solutions are only cataloguing what they directly ingested. Therefore, if you have this external ingestion (Pig script) there is no way the management system can be aware. The Zaloni platform can scan these external directories for input files and catalog these entities.
Method 2. Testing development directly in Hive before working with the data.
In this case, you create hive tables and then create pipelines that transform those hive tables. Most of the time, teams do this step before ever ingesting data to ensure the transformations work the way they think they will before running data through them. Even if teams use tools outside of the Zaloni platform to do this, the platform can scan a Hive DB, go through every single table associated with it (both internal and external), and either create or update existing entities based off of the changes. Again we can do this across thousands of tables as needed.
In both scenarios, the real value of the Zaloni Data Platform (ZDP) shows in two ways:
- You don’t have to start from nothing. The platform catalogs everything you have and immediately gets you into production.
- If you have teams making changes outside of our tool, the platform can still register those changes so the operations teams are aware.
Further, the automated data inventory in the Zaloni Data Platform can discover and operationalize the existing tables in your data silos. It scans input files at massive scale across thousands of directories to catalog all of the files, associate them with existing entities, or create new entities if they don't exist in the Zaloni platform. It can even register files and tables even if Zaloni tools weren't originally used to ingest the files.
In terms of self-service access to the business data, the platform can automatically sync with these tables. This means that no matter what tools developers and analysts are using on the data, you still have full access to the cataloging and operational information that’s critical for your business.
About the AuthorMore Content by Carlos Guerra