Is Metadata Management Really That Hard in Hadoop?

March 11, 2015

We previously discussed that metadata is the keystone to your data lake. To get a complete view of your data lake you need to capture technical, business and operational metadata; and then be able to search this information.  If ignored, your Hadoop data lake could become a very large collection of lost or unusable data. Although it is highly recommended to capture metadata while defining your data pipeline, it is possible to capture this critical information once the data is already in the data lake.

In the Hadoop ecosystem, HCatalog, which is part of Hive, is where the metadata is stored. However, this is only one dimension, which is the technical aspect. Creating and maintaining the core metadata framework may not be the best investment of your development dollars. Instead, those dollars might be better spent to build applications for your specific business problems.

How Zaloni Handles Metadata

Zaloni's data lake management platform handles metadata through the concept of Entities. Entities are analogous to tables in a relational database and contain fields.  While defining an Entity, you can specify additional attributes, which help subsequent discovery of data, and correlation with incoming files. This includes:

  • Business names and descriptions at both the entity and field levels
  • Primary keys
  • Data quality rules at both the entity and field levels
  • Change data capture indicating how to determine the latest record for a given primary key
  • HCatalog integration, which publishes the schema to the Hive metastore (used by HCatalog), optionally indicating where the data is stored

The platform captures operational metadata automatically as part of the ingestion. For example, given an Entity you can see all the files that have recently been ingested, where they are stored, and how big the files are. In this way, we help to organize the data as it's coming into HDFS.


Schema Evolution

Through our experience with production deployments of Hadoop, we have learned that although you may define the metadata in the beginning of the project, it is inevitable that this will evolve. Columns get added, or the structure is changed.

In our platform, when the structure of an Entity changes, you can create new versions. This has the benefit that files previously ingested are associated with the previous versions, while new data coming in will be associated with the new version of the entity. No need to dig through any history to determine what the structure of the file was in the past.


Data Quality

Zaloni will also allow you to associate rules, or rule sets to fields within entities. This gives you visibility into how "clean" the data is. For example, you may have a rule that the date field must be in standard format YYYY-MM-DD. After the data is ingested, you can run the data quality rules to analyze which records do not meet the standard format.

Even at the Entity level, you can define rules to ensure that the incoming file meets a given constraint, such as the number of records, or schema correctness. 


Automate Metadata Capture for Enterprise Visibility in the Data LakeOne of our customers (probably the most advanced big data deployment in our experience) has taken metadata management to the next level. They have automated the schema evolution such that any time that there is a metadata change, they have an automated way of creating a new revision of the Entity.

The platform provides a REST API that allowed them to automate this functionality. The API also allows you to create new Entities, update and delete them. We understand that a data lake rarely stands alone; it is often part of a larger ecosystem with complex upstream and downstream systems, especially in large-scale production environments.


Looking to the Future

There are currently two interesting areas that are coming soon in the next platform version.

One of them is Avro support, which will make schema evolution a breeze. It also has the added benefit of being fast and compact. 

The other area of interest is automated creation of entities by simply pointing to a source. The first that we will support is Oracle Golden Gate. This will:

  • Create platform entities to match the structure of the source table
  • Create Hive metastore entries (used by HCatalog) for those platform entities, so the data can be queried with Hive, Pig or other tools leveraging HCatalog
  • Write data directly to HDFS, and update the file ingestion history
  • Associate incoming data with post-ingestion processing, such as change data capture.

In this way, Zaloni provides an integrated solution for metadata management that ties seamlessly into the ingestion and transformation phases.


For more metadata management information, see get the latest eBook, "Understanding Metadata":

Understanding Metadata               

Previous Article
8 Questions To Ask Before Your Next Hadoop Data Management Project
8 Questions To Ask Before Your Next Hadoop Data Management Project

Hadoop is not dead. In 2017, Hadoop data lake production instances are growing and technology is outpacing ...

Next Article
Secondary Sorting in Hadoop
Secondary Sorting in Hadoop

We continue our technical series with a post on Secondary Sorting in Hadoop.


Get a custom demo for your team.

First Name
Last Name
Phone Number
Job Title
Comments - optional
I would like to subscribe to email updates about content and events
Zaloni is committed to the best experience for you. Read more on our Privacy Policy.
Thank you! We'll be in touch!
Error - something went wrong!