Overcoming output limitations
A core tenet of the Hadoop cluster architecture is the replication of data blocks across nodes. The reasoning behind this is that it protects against the failure of an individual node by having backup data elsewhere. An issue with this is that slow write speeds to the HDFS can hurt the performance of a job with multiple steps, especially when the size of the output of a step is close to the size of the input. Even with frameworks such as Spark, which take advantage of in-memory computation for individual jobs, sharing of the output is usually limited by network bandwidth or disk throughput.
Improving lineage, checkpoints, and storage
Tachyon/Alluxio is an open source 'memory-centric distributed storage system', introduced by a group of researchers from UC Berkeley and MIT in a seminal paper in 2014. Tachyon/Alluxio creates a transparent file system that sits on top of and mirrors the native file system (HDFS, S3, etc) that is available to jobs to write to/read from. A couple concepts are at work here:
- Lineage: Tachyon/Alluxio provides an API to record the inputs/outputs of jobs to provide fault-tolerance. Using a directed acyclic graph (DAG) of all jobs, the optimal datasets are checkpointed to guarantee a bounded recomputation time.
- Asynchronous checkpointing: ‘hot’ files (files accessed a lot) and leaves of the DAG are checkpointed to the storage layer.
- Tiered storage system: Storage is filled in the order of memory -> SSD -> HDD.
An example workflow
Suppose I have a workflow with a couple steps: I run a query on some data in a Hive table using Spark SQL, take the output of that and run a Spark job on it, then use a Spark ML library to run some analysis. For parts 1 and 2, I may have outputs that are close to the size of my inputs. In a traditional HDFS-centric system, the performance of the overall workflow would be bounded by the write throughput. However, with Tachyon/Alluxio, we leave the output in-memory, able to be shared between jobs. Our workflow now looks like this:
- Run a query (S1) on data in a data warehouse: Output O1 is stored in Tachyon.
- Run a Spark transformation (S2) on O1: Output O2 is stored in Tachyon.
- At this point the lineage information for S1 is stored by Tachyon, so O1 might be removed from memory based on resource allocation.
- Run a Spark ML algorithm on O2: output is a model.
The above image shows a view of a Tachyon/Alluxio file system. ‘hello’ and ‘hello2’ are two files that are currently cached in-memory.
Evolving compatibility and deploymentExisting Spark and MapReduce programs are compatible with Tachyon/Alluxio without any code change. Tachyon also has multiple deployment options, including integration with Mesos and YARN. There are also a case study at Baidu that is available online for further reading.