Why is managed data ingestion important?
Picture this – you started the pilot on your Big Data project. You got some data in, added Metadata about it, and ran some transformations. You had a great team that scripted the ingestion from each data source, and were able to stitch together the workflows to transform and prepare the data.
Great! Now comes the time to add new pipelines for other groups. While this is good news, it comes with challenges – they want to know how much data is going into Hadoop. They want to map the ingestion to their own definition of metadata. They want a historical view of incoming data. They want to you to report on the quality of the data based on rules that they provide. And by the way, the data is not in files, but rather in a relational database somewhere. Can your team script all this? Even if you find a way to deliver, maintaining it is a huge task by itself – responding to changing requirements, and additions of even newer sources.
Mapping ingested data to metadata in Bedrock
Bedrock handles file ingestions through the concept of a Landing Zone. Think of this as the point where data files from various sources get dropped, to then be moved to the Hadoop File System. In our experience, this is a fairly typical way of handling file ingestions to control the flow of data going in/out of HDFS. It has the added benefit of defining network connectivity only once, rather than having to configure multiple clients to access Hadoop.
You provide the file pattern, where to pick up files from, and where they need to go on HDFS. Bedrock will monitor the source directory for that file pattern and handle the ingestion. If you have defined metadata for this file, such as business name, structure and so on, it will also associate any incoming files with that metadata.
What if you already have an existing system in place that puts files directly into Hadoop (think Sqoop and Flume), but you still want to associate it with the metadata and run some transformations? Bedrock’s HDFS-based Landing Zone will handle that for you. What if you have relational data that you want to ingest? The Zaloni team has experience with that, too.
As the data comes in, it is tracked and an Inventory Dashboard is provided that allows you to explore the data through different facets, such as size, ingestion date, entity types, and sources. Every file that is ingested is watermarked for future traceability as data gets transformed. Any failures in the ingestion are reported via its notification system, and the files are moved to the holding area that is configurable.
Since Bedrock handles both file and relational ingestions, streaming ingestion is the next area that we are going to target. This feature is slated for release in the first half of 2015.