Excerpt from ebook, Understanding Metadata: Create the Foundation for a Scalable Data Architecture, by Federico Castanedo and Scott Gidley.
Modern data architectures promise the ability to enable access to more and different types of data to an increasing number of data consumers within an organization. Without proper governance, enabled by a strong foundation of metadata, these architectures often show initial promise, but ultimately fail to deliver.
Let’s take logistics distribution as an analogy to explain metadata, and why it’s critical in managing the data in today’s business environment. When you are shipping one package to an international destination, you want to know where in the route the package is located in case something happens with the delivery. Logistic companies keep manifests to track the movement of packages and successful delivery along the shipping process.
Metadata provides this same type of visibility into today’s data-rich environment. Data is moving into, out of, and within companies. Tracking data changes and detecting any process that causes problems when you are doing data analysis is hard if you don’t have information about the data and the data movement process. Today, even the change of a single column in a source table can impact hundreds of reports that use that data—making it extremely important to know beforehand which columns will be affected.
What is Metadata?
Metadata is data about each dataset, like size, the schema of a database, format, last modified time, access control lists, usage, etc. The use of metadata enables the management of a scalable data lake platform and architecture, as well as data governance. Metadata is commonly stored in a central catalog to provide users with information on the available datasets.
Metadata can be classified into three groups:
This type captures the form and structure of each dataset, such as the size and structure of the schema or type of data (text, images, JSON, Avro, etc.). The structure of the schema includes the names of fields, their data types, their lengths, whether they can be empty, and so on. Structure is commonly provided by a relational database or the heading in a spreadsheet, but may also be added during ingestion and data preparation. There is some basic technical metadata that can be obtained directly from the datasets (i.e., size), but other metadata types are derived.
This group captures the lineage, quality, profile, and provenance (e.g., when did the data elements arrive, where are they located, where did they arrive from, what is the quality of the data, etc.). It may also contain how many records were rejected during data preparation or a job run, and the success or failure of that run itself. Operational metadata also identifies how often the data may be updated or refreshed.
Finally, business metadata captures what the data means to the end-user to make data fields easier to find and understand, for example, business names, descriptions, tags, quality, and masking rules. These tie into the business attributes definition so that everyone is consistently interpreting the same data by a set of rules and concepts that is defined by the business users. A business glossary is a central location that provides a business description for each data element through the use of metadata information.Metadata information can be obtained in different ways. Sometimes it is encoded within the datasets, other times it can be inferred by reading the content of the datasets, or the information can be spread across log files that are written by the processes that access these datasets.
In all cases, metadata is a key element in the management of the data lake, and is the foundation that allows for the following data lake characteristics and capabilities to be achieved:
- Data visibility
- Data reliability
- Data profiling
- Data lifecycle/age
- Data security and privacy
- Democratized access to useful data
- Data lineage and change data capture
Data lakes must be architected properly to leverage metadata and integrate with existing metadata tools, otherwise it will create a hole in organizations’ data governance process because how data is used, transformed, and related outside the data lake can be lost. An incorrect metadata architecture can often prevent data lakes making the transition from an analytical sandbox to an enterprise data platform.
Ultimately, most of the time spent in data analysis is in preparing and cleaning the data, and metadata helps to reduce the time to insight by providing easy access to discovering what data is available and maintaining a full data tracking map (data lineage).
Download the full ebook.
About the AuthorMore Content by Scott Gidley