The keystone to your data lake: metadata
You know that stone at the top of an arch – the keystone – the one on which all other parts of the arch depends? Well, in a Hadoop data lake, I like to think of metadata as the keystone. It’s the essential (and often overlooked) component that enables you to derive value from your Hadoop investment. Metadata is what allows you to know what data you have and to access it when needed. Without this linchpin, be prepared for your data lake project to fall apart.
Meet your metadata
Metadata is more than just the schema of a file – it’s anything and everything about the data. I like to loosely categorize metadata into three areas: technical, operational and business. When you use all three categories to qualify your data, you get a pretty complete view of the inventory in your Hadoop data lake. Let’s take a look at the three different types:
- Technical metadata refers to the form and structure of the data. It’s analogous to the schema or table definition in a database, but goes beyond it. In addition to data types, technical metadata captures the data quality rules that describe what data is valid and the sensitivity of the data (e.g., is it PII data, etc.). Technical metadata may also specify what the primary key of a dataset is or what fields can used to identify incremental updates to a dataset.
- Operational metadata captures useful information like when the data was ingested, where it came from and who ingested it. Operational metadata can be used to show data lineage and how data moves from initial raw form into new prepared or transformed datasets that are intended for specific analytics or reporting. Operational metadata also is useful for analysis and reporting of Hadoop data lake usage, allowing a Hadoop data lake administrator to track how much data is being ingested or created by each user.
- Business metadata gives us labels and tags and names that make data much easier for someone to browse or search. When provided by a data steward or data architect, this metadata becomes extremely important as part of a rich data catalog for users wanting to explore the data in the lake. These data consumers need to understand the business-level attributes of the data to help them decide whether it is the right dataset for their use.
Without metadata, your data lake is lost at sea
Capturing and managing metadata across all three categories gives you a rich set of information that ties together the entire managed data pipeline, from raw to analytics-ready. You also create the foundation for a user-friendly data inventory, as well as catalog capabilities. Without metadata, and without a platform to collect and manage it, it’s very easy for a Hadoop data lake to become a very large collection of lost or unusable data.
Lastly, the key to the keystone is: start early. Make sure metadata governance is part of your Hadoop implementation plan from the beginning. While reverse engineering or discovering metadata after the fact can be done, it’s no fun and can result in gaps in the metadata.
Are you collecting and managing metadata in your Hadoop data lake today? If you have questions about whether you’re doing it to your best advantage, please reach out. We’d love to chat about ways you could potentially improve your Hadoop metadata management strategy.