Cluster Disposability in Hadoop

July 12, 2017 Adam Diaz

In Hadoop, disposability typically refers to compute nodes. Nodes are disposable due to the basic properties of Hadoop and proper Hadoop architecture.

Hadoop itself, for example, will let one node have one block of a file with, by default, two other copies in alternate locations within that cluster. Under replicated blocks are restored by the normal operations of the file system. The jobs on those nodes are usually part of a larger operation so the loss of work on any single node is typically not catastrophic.

The compute node is then considered “disposable” as both work being run on and data stored in a single node could be completely removed from the cluster without bringing operations (or even a single job) to a halt. In fact, provided that each node is not over-scoped (compute nodes with overwhelming amount of disk), data stored in the cluster is safer as the node count scales.

Decreased Risk of Data Loss

The basic concept of linear scalability in Hadoop relies on a balanced compute node architecture that is more horizontally scaled than vertically. Vertical scale condenses risk of data loss and increases the negative effects on processes if/when that one node goes offline (nevermind loss of basic parallelism). The cpu, memory, and disk should be added in a balanced way to provide appropriate linear scale and enable node disposability.

As the node count grows whole racks can be disposable. Then we look at rows of racks being disposable. After a certain node count, larger numbers of nodes can be offline without bringing the system down or even slowing existing running jobs.

Making Disposable Clusters a Reality

At its logical end, what if all the compute nodes failed? Without a secondary site, the result would be a complete loss of function. Nodes as a component have been known to be “disposable” in Hadoop for a long time. It was also accepted dogma that whole clusters were NOT disposable. The concept of Amazon Elastic MapReduce (EMR) definitely challenged this but the new model presented by Google Cloud Dataflow really brings cluster disposability to reality.

An average EMR bootstrap time can be counted in 10s of minutes (~30) while Dataflow can bootstrap in a fraction of that time (~90 seconds). A recent blog by the GCP challenged much of what I thought about Hadoop.

A lot of noise has been made about the fight between physical versus cloud and multitenancy in the Hadoop world. At first, it was Yarn that would save the day. The capability to write Yarn Application masters would save us all but it seemed low rates of adoption led to easy button projects like Apache Slider. The newest generation of multitenancy efforts includes things like Docker, Mesos and Myriad with companies like Hortonworks calling it “DataLake 3.0” in a new 6 part series. But is it all necessary?

Cloud-based Hadoop

The next generation of Hadoop being cloud-based scales dynamically. No Hadoop tuning necessary. The total cost, in terms of time to spin up a cluster, is extremely low. Why not use a single cluster for a single job then throw the cluster away? Use ALL the resources of a single cluster to get work done as fast as possible.

In this way the use of the cluster is also isolated to a single user which also speaks to security concerns present in multitenant clusters. This also allows one to almost completely side step Yarn as a requirement.

Concerns about how to optimize application masters, job placement, and scheduling are all concerns of the past. The ease of scaling, reduced complexity, and lower bill cost make the new GCP just as disruptive to Hadoop as Hadoop was to RDBMS.

About the Author

Adam Diaz

Big data & Hadoop thought-leader

More Content by Adam Diaz
Previous Article
10 Measures for GDPR Compliance [Infographic]
10 Measures for GDPR Compliance [Infographic]

GDPR is quickly becoming a global data privacy crisis. Businesses are finding out what's needed to ensure c...

Next Article
Performance Tuning an HBase Vault
Performance Tuning an HBase Vault

The solution to achieving performance for a big data application is often hidden in the kind of data being ...

Want a governed, self-service data lake?

Contact Us