Schema and metadata management in Hadoop is different than management in a traditional data warehouse. Because there’s more flexibility when loading data into Hadoop (i.e., you don’t have to capture the schema on write), it’s extremely important to consider your strategy up front.
Let’s take a look at why. Without schema, using Hadoop is like trying to shop for food at a grocery store that doesn’t have any labels to tell you where or what anything is. At this “schema-on-read” grocery store, you go where you think the pasta is and grab a box. Is it spaghetti? You don't know because you won’t get the "schema" until you get home and open the box. When you do get home and unpack your groceries, you may find they’re not what you need. You didn't have schema (or metadata) about the ingredients, so the results of your cooking (analysis) are going to be questionable. In a regular grocery store, the "schema" is readily available so you can be sure you’re getting exactly the right ingredients and you know where they came from, how much there is, and more.
Methods for adding schema in Hadoop
There are no two ways about it: you must have schema to read/use data. So the real question is not if, but when adding schema makes the most sense for your Hadoop project. With Hadoop, you have the option of capturing the schema when you use the data (“on read”) or when you load data into Hadoop (“on write”). Each method has its benefits and drawbacks.
What’s exciting about Hadoop is that you can load large volumes of any type of data and do it relatively quickly. You capture everything. In a relational database, data can be rejected if it doesn’t fit the schema. This slows things down and often data that isn’t considered important at the time is lost and irretrievable later.
Determining your schema ahead of time
Yes, there are some valid reasons for doing schema on read. For example, if you really don’t know the schema for the data you need to load, reverse-engineering the schema on read is a powerful tool. But if you do know the schema up front – and many projects do, as the data is often coming from relational databases or is in log or streaming message formats – it’s smart to capture it as soon as you can, on write or soon after. That’s because, in our experience, we’ve found that the odds of users going back to add it are low. Picture it: without schema, the Data Lake becomes littered with individual files that have no association to corresponding entities or structures. As a result, the data is difficult for analysts, business users or data scientists to use. It really is worth taking the time up front to determine the schema for what you’re capturing and then associate that schema with the data.
Why “on write” is almost always right
Although you can add schema on read, doing this slows down data analysis, as the user has to first spend a lot of time understanding what the data actually is before doing any analysis.
Here are two typical scenarios for which schema on write makes more sense:
- Data users want to use interactive applications to analyze data, like dashboards. Schema on read is impractical for these interactive applications.
- You have multiple data users. With schema on read, each user needs to add schema themselves, versus adding schema one time (on write) and having it available in a central place for all users to access.
Tools to capture schema
That said, there’s no great way to capture schema in Hadoop. You can use Hcatalog from Apache, but that will require you build an application around it to manage the entries to the catalog. Or, you can use a third-party tool like Zaloni Bedrock. Bedrock has a simple-to-use web interface and REST API’s to integrate with any application to manage these schemas. And don’t worry – Bedrock is flexible. While we do advocate schema on write, Bedrock also supports schema on read to meet the specific needs of each of our clients.