Excerpt from ebook, Understanding Metadata: Create the Foundation for a Scalable Data Architecture, by Federico Castanedo and Scott Gidley.
Metadata generation can be an exhausting process if it is performed by manually inspecting each data source. This process is even harder in larger companies with numerous but disparate data sources. The key is being able to automate the capture of metadata on arrival of data in the lake, and identify relationships with existing metadata definitions, governance policies, and business glossaries.
Sometimes metadata information is not provided in a machine-readable form, so metadata must be entered manually by the data curator, or discovered by a specific product. To be successful with a modern data architecture, it’s critical to have a way to automatically register or discover metadata, and this can be done by using a metadata management or generation platform.
Since the data lake is a cornerstone of the modern data architecture, whatever metadata is captured in the data lake also needs to be fed into the enterprise metadata repository, so that you have an end-to-end view across all the data assets in the organization, including, but beyond, the data lake. An idea of what automated metadata registration could look like is shown in Figure 1-2.
Figure 1-2 shows an API that runs on a Hadoop cluster, which retrieves metadata such as origin, basic information, and timestamp and stores it in an operational metadata file. New metadata is also stored in the enterprise metadata repositories, so it will be available for different processes.
Another related step that is commonly applied in the automation phase is the encryption of personal information and the use of tokenization algorithms.
Ensuring data quality is also a relevant point to consider in any data lake strategy. How do you ensure the quality of the data transparently to the users?
One option is to profile the data in the ingestion phase and perform a statistical analysis that provides a quality report by using metadata. The quality can be performed at each dataset level and the information can be provided using a dashboard, by accessing the corresponding metadata.
A relevant question in the automation of metadata is how do we handle changes in data schema? Current solutions are just beginning to scratch the surface of what can be done here. When a change in the metadata occurs it is necessary to reload the data. But it would be very helpful to automate this process and introspect the data directly to detect schema changes in real time. So, when metadata changes, it will be possible to detect modifications by creating a new entity.
To learn more about metadata and next generation architectures, download the full ebook here.
About the AuthorMore Content by Scott Gidley