Bedrock Deep Dive: Hadoop Workflow Management for the Enterprise

March 6, 2015

Bedrock creates the foundation for the development of Hadoop-based applications. It eliminates, or at a minimum, minimizes the time required for infrastructure projects that need to be delivered for building a production-level Hadoop deployment.

One aspect of this is for workflow management. At first glance, this may not seem very complex, considering you have tools like Oozie that you can use to create workflows. However, when you start considering real-world scenarios that require integration with, or trigger external systems things can get messy real fast. Not to mention that even the most basic functionality like moving files, emailing someone, or encoding files require special scripting. Further, for some of these tools you need to modify an XML file, which can be challenging to maintain once you start scaling up.

As you get more workflows, you realize that although there are tools to help you with monitoring jobs, and seeing how they are running on the cluster, there are few tools that can monitor the workflows at an application level. If a workflow fails, it can be quite a task to determine where the failure happened, how much of the work was already done, and what it takes to restart the workflow from the point of failure. How would you even know that a workflow failed?

What does Bedrock solution encompass?

Bedrock’s solution to workflow management encompasses 3 broad areas: Design, Execution, and Monitoring. 

Design

The assumption here is that you have already created MapReduce programs, or Hive queries specific to your scenario. Bedrock has a drag-and-drop user interface with a canvas where you can drop your actions and connect them. It provides control actions to run actions in parallel, or through decisions to manage the execution flow. There are some utility actions out of the box to allow you to send an email, watch for files, or move/copy files. 

There are also some pre-built MapReduce actions that allow you to perform watermarking for data lineage, masking and tokenization to hide sensitive fields, or UTF8 conversion. One feature that is especially exciting to our customers is Data Quality. Bedrock evaluates the incoming data against a set of rules that you provide, either at the field (or column) level, or at the entity (or schema) level. This lets you know whether the data meets the standards set by you. 

If you want to call an external system as part of your workflow (invoking an ODI transformation, for example), you can use the Shell action to trigger it. If workflows are set up to use parameters, you can reuse these workflows for multiple files, which minimizes the maintenance efforts when workflows need to change.

Execution

To provide flexibility in running workflows, Bedrock can trigger them in one of three ways:

  1. As a post-ingestion workflow which gets triggered as soon as a specified file is ingested - This is the most commonly used path, since you prepare the file as soon as it arrives rather than waiting for when it needs to be used
  2. As a scheduled workflow, which can be one-time or recurring - Typically used for performing aggregations at specific intervals, such as hourly or daily.
  3. On-demand - Typically used during the development cycle. Once solidified, the workflow is triggered as either post-ingestion, or scheduled.

If you have capacity queues configured, then Bedrock will ask you to specify which queue you want to submit the workflow actions to. This is especially important if, for example, you have separate queues defined for your production and development jobs.

Monitoring

From an operations standpoint, Bedrock provides views to see which workflows are running, queued, or completed. For the ones that are running, you can drill down to see which step, or action, is currently being executed. For the completed workflows, you can see which ones failed and which were successful. You can see the start and end times at both the workflow and the step level. This helps in analyzing which workflows are taking the longest.

In case of any workflow failure, Bedrock will help you determine which step failed, and restart the workflow from the point of failure. You can also view and download the execution logs. All this is achieved through a point-and-click interface.

Lastly, you can use Bedrock to create distribution lists. Then, while creating workflows, you can specify certain distribution lists to be notified for specific workflow events. These events include Workflow submission, failure, completion and so on.

Although all these features seem intuitive and necessary to build a production level Hadoop Data Lake, it may not be the most effective use of your internal resources to build this basic plumbing. Let Bedrock handle the framework level capabilities, while you focus on your business logic and get to your results faster.

Contact us  to learn more about Bedrock and how we can help you. 

 

Previous Article
Is Metadata Management Really That Hard in Hadoop?
Is Metadata Management Really That Hard in Hadoop?

Is metadata management really as hard as you think? Learn to make metadata a priority in your Hadoop data l...

Next Article
Why Schema and Metadata Matter in Hadoop Data Management

Schema and metadata management in Hadoop is different from management in a traditional data warehouse. Can ...

×

Get the latest tips and how-to's delivered straight to your inbox!

First Name
Last Name
Zaloni Blog Email Subscription
Thank you!
Error - something went wrong!