Excerpt from report, Managing the Data Lake: Moving to Big Data Analysis, by Andy Oram, editor at O’Reilly Media
Why do you need to preserve metadata about your data? Reasons for doing so abound:
- For your analytics, you will want to choose data from the right place and time. For instance, you may want to go back to old data from all your stores in a particular region.
- Data preparation and cleaning require a firm knowledge of which data set you’re working on. Different sets require different types of preparation, based on what you have learned about them historically.
- Analytical methods are often experimental and have some degree of error. To determine whether you can trust results, you may want to check the data that was used to achieve the results, and review how it was processed.
- When something goes wrong in any stage from ingestion through to the processing, you need to quickly pinpoint the data causing the problem. You also must identify the source so you can contact them and make sure the problem doesn’t reoccur in future data sets.
- In addition to cleaning data and preventing errors, you may have other reasons related to quality control to preserve the lineage or provenance of data.
- Access has to be restricted to sensitive data. If users deliberately or inadvertently try to start a job on data they’re not supposed to see, your system should reject the job.
- Regulatory requirements may require the access restrictions mentioned in the previous bullet, as well as imposing other requirements that depend on the data source.
- Licenses may require access restrictions and other special treatment of some data sources.
Ben Sharma, CEO and co-founder of Zaloni, talks about creating “a single source of truth” from the diverse data sets you take in. By creating a data catalog, you can store this metadata for use by downstream programs.
Zaloni divides metadata roughly into three types:
- Business metadata: This can include the business names and descriptions that you assign to data fields to make them easier to find and understand. For instance, the technical staff may have a good reason to assign the name loc_outlet to a field that represents a retail store, but you will want users to be able to find it through common English words. This kind of metadata also covers business rules, such as putting an upper limit (perhaps even a lower limit) on salaries, or determining which data must be removed from some jobs for security and privacy.
- Operational metadata: This is generated automatically by the processes described in this report, and include such things as the source and target locations of data, file size, number of records, how many records were rejected during data preparation or a job run, and the success or failure of that run itself.
- Technical metadata: This includes the data’s type and format (text, images, JSON, Avro, etc.) and the structure or schema. This structure includes the names of fields, their data types, their lengths, whether they can be empty, and so on. Structure is commonly provided by a relational database or the headings in a spreadsheet, but may also be added during ingestion and data preparation. Zaloni’s data lake management platform integrates with Apache Hcatalog for technical metadata so that other tools in the Hadoop ecosystem can take advantage of the structure definition.
As suggested in the previous list, one can also categorize metadata by the way it is gathered:
- Some metadata is embedded in the data, such as the schema in a relational database.
- Some metadata pertains to the data acquisition process: the source of the data, filename, time of creation, time of acquisition, file size, redundancy checks generated to make sure the transmission was not corrupted, and MD5 hashes generated to uniquely identify a file.
- Some metadata is created during ingestion. For instance, a watermark can be added to a file or to a column within the file. If you take JSON or other relatively unstructured data and create a schema around it, that schema becomes part of the metadata.
- Some metadata is created during a job run, such as the number of records successfully processed, the number of bad fields or bad records, and how long a job took.
The next question is how to create metadata. Many tools can extract the easy stuff, such as file sizes and timestamps, as the stages of processing proceed. Other metadata requires custom-written programs that do such things as tag particular data fields you’ll want to extract later.
At any stage of processing, you may choose to update the metadata. Each stage can also consult the metadata when applying rules for user access, cleaning, and submitting data to jobs. We’ll see later how, at least in theory, storing feedback in metadata can create an environment of continuous quality improvement.
Currently, one of the huge challenges in data management is communicating metadata to downstream parts of a workflow. A good deal of the Zaloni Data Platform’s benefits rest on its ability to do this conveniently.
Read the full report here.