We previously discussed that metadata is the keystone to your data lake. To get a complete view of your data lake you need to capture technical, business and operational metadata; and then be able to search this information. If ignored, your Hadoop data lake could become a very large collection of lost or unusable data. Although it is highly recommended to capture metadata while defining your data pipeline, it is possible to capture this critical information once the data is already in the data lake.
In the Hadoop ecosystem, HCatalog, which is part of Hive, is where the metadata is stored. However, this is only one dimension, which is the technical aspect. Creating and maintaining the core metadata framework may not be the best investment of your development dollars. Instead those dollars might be better spent to build applications for your specific business problems.
How Bedrock Handles Metadata
Bedrock handles metadata through the concept of Entities. Entities are analogous to tables in a relational database and contain fields. While defining an Entity, you can specify additional attributes, which help subsequent discovery of data, and correlation with incoming files. This includes:
- Business names and descriptions at both the entity and field levels
- Primary keys
- Data quality rules at both the entity and field levels
- Change data capture indicating how to determine the latest record for a given primary key
- HCatalog integration, which publishes the schema to the Hive metastore (used by HCatalog), optionally indicating where the data is stored
Bedrock captures operational metadata automatically as part of the ingestion. For example, given an Entity you can see all the files that have recently been ingested, where they are stored, and how big the files are. In this way, Bedrock helps to organize the data as it's coming into HDFS.
Through our experience with production deployments of Hadoop, we have learned that although you may define the metadata in the beginning of the project, it is inevitable that this will evolve. Columns get added, or the structure is changed.
In Bedrock, when the structure of an Entity changes, you can create new versions. This has the benefit that files previously ingested are associated with the previous versions, while new data coming in will be associated with the new version of the entity. No need to dig through any history to determine what the structure of the file was in the past.
Bedrock will allow you to associate rules, or rule sets to fields within entities. This gives you visibility into how "clean" the data is. For example, you may have a rule that the date field must be in standard format YYYY-MM-DD. After the data is ingested, you can run the data quality rules to analyze which records do not meet the standard format.
Even at the Entity level, you can define rules to ensure that the incoming file meets a given constraint, such as number of records, or schema correctness.
One of our customers (probably the most advanced big data deployment in our experience) has taken metadata management to the next level. They have automated the schema evolution such that any time that there is a metadata change, they have an automated way of creating a new revision of the Entity.
Bedrock provides a REST API that allowed them to automate this functionality. The API also allows you to create new Entities, update and delete them. We understand that a data lake rarely stands alone; it is often part of a larger ecosystem with complex upstream and downstream systems, especially in large-scale production environments.
Looking to the Future
There are currently two interesting areas that are coming soon in the next version of Bedrock.
One of them is Avro support, which will make schema evolution a breeze. It also has the added benefit of being fast and compact.
The other area of interest is automated creation of entities by simply pointing to a source. The first that we will support is Oracle Golden Gate. This will:
- Create Bedrock entities to match the structure of the source table
- Create Hive metastore entries (used by HCatalog) for those Bedrock entities, so the data can be queried with Hive, Pig or other tools leveraging HCatalog
- Write data directly to HDFS, and update Bedrock's file ingestion history
- Associate incoming data with post-ingestion processing, such as change data capture.
In this way, Bedrock provides an integrated solution for metadata management that ties seamlessly into the ingestion and transformation phases.