This is the first in a multi-part series of blogs discussing Hadoop distribution differences to help enterprises focus on the important factors involved in choosing a Hadoop distribution—without hype or marketing spin. It was written in collaboration with Gregory Wood. Zaloni has a long history of helping companies gain tangible business value from Hadoop, regardless of the distribution.
Zaloni engineers its products to work with nearly any Hadoop distribution; however, we are often asked what distribution is “best.” The choice of a Hadoop distribution can sometimes be highly debated. There are many aspects to making this choice, such as:
- The versioning of the components within a distribution due to, among many factors, connectivity to both upstream and downstream technologies
- Legacy EDW technology, which demands a certain version of Apache Hive based on the APIs used to connect Hadoop to other data sources and tooling
- What versions of what packages are included in which distributions
It is often easy to say from a 30,000-foot view that Hadoop version X.Y.Z is needed—and call it a day. On the other hand, many Hadoop consumers have more than one distribution and are challenged to understand a complex, intercompany support matrix of versions and their related features. To make matters worse, not all Hadoop vendors package the same way or use the same type or number of patches.
Here are some basic insights to begin to help you select what may work best for you:
- Most Hadoop distributions repackage some core set of packages, including Apache Hadoop itself, along with technologies like Apache Hive and Apache HBase with variation in choices along the edges of each.
- Companies like Cloudera and Hortonworks have bound most of their major version numbers tightly to a series of Apache projects and their value-added packages to make a complete distribution.
- Notable exceptions include Apache Ambari, which can be used as a management interface to multiple versions of the Hortonworks Data Platform.
- MapR Technologies is an outlier in this discussion in that they have a complex relationship of package support distribution compatibility. Not only can you, in many cases, continue to run that old version of Hive when a major version upgrade of MapR comes out, you many also run another version of Hive alongside it. This is a wonderful option for customer flexibility. This is especially helpful with production workloads where changes in versions might force a code or behavior change that a might disrupt critical business functionality. It also eases the transition into the newer versions of packages allowing testing to occur on the newer version with limited disruption of production work.
- Most distributions now release “previews” or “tech releases” ahead of major version upgrades, so this can be another testing avenue for distributions that don’t have a MapR style package.
So how do you tell what specifically is supported? It requires digging into documentation, release notes and, in some cases, taking a look at the online package repositories themselves. We have done the digging around for you for the current “state of the state” to create the following chart.
Please see the links at the end for resources used to compile this chart.
The challenge with keeping track of it all is the speed at which the information in the above matrix will change. Apache projects, along with open source in general, are able to innovate at a pace that eclipses traditional software development. This not only causes confusion for consumers, but also very complex support and testing issues for software vendors who play in the Hadoop space.
Forces such as pricing, legacy technology support and company technology mandates will influence your choice—but ultimately all distributions have strengths and weaknesses. Our strong recommendation is to choose the distribution that works best for the use case at hand.
Decoding the package version
Different distributions share their package information mainly via the documentation. Most times, finding a major and minor version number in the docs is enough. Scratching just below the surface reveals a good bit of difference and some unexpected results.
Example 1: Cloudera
Let's start with Cloudera. To their credit, the documentation provided is very forthcoming. For example, when examining their release notes for a specific package, e.g., Hadoop, it provides a detailed version number, as you can see below.
In this case, Hadoop is the package name and 2.6.0 is the version of the Hadoop package that works with CDH version 5.8.0 PLUS 1592 patches. All these patches are easily accessible and documented in the package specific release notes available in the chart. Looking at that detail, we see bug fixes and improvements to Yarn, HDFS and Hadoop itself that the product managers at Cloudera decided were important to their distribution. It should be noted that this is the entire reason for using a distribution of Hadoop. You can always use Apache directly. However, for an integrated, more user-friendly experience, companies like Cloudera have taken the time to use their experience to smooth out challenges you would have using an Apache source directly.
Example 2: Hortonworks
Looking next at Hortonworks, we see only major and minor versions from their main documentation page. In the navigation pane to the left, there is a section called Apache Patch Information that details by package what changes were made. Doing the math of the number of changes is left up to you. To parallel our Hadoop example from Cloudera, there are some 130-odd changes listed for just Hadoop. This includes changes from HDP 2.4.0 onward, as the release notes tell us the packages did not change version from HDP 2.4.0 to 2.4.2. The changes along a similar vein are changes to Hadoop, HDFS and Yarn. The specific package version requires looking closely at the Hortonworks package repository. Looking at the packages for HDP 2.4.2 we find as an example: Hadoop_2_4_2_0_258-yarn-22.214.171.124.4.2.0-258.el6.x86_64.rpm
This can be decoded as project name (Hadoop) with HDP version 2.4.2 and package name Yarn version 2.7.1 for HDP version 126.96.36.199 dash build number 258 for enterprise linux 6 on Intel 64 bit processor. In this case, the package name itself does not confer patch version. To clarify, this is a typical best practice for development operations to name packages this way. Typically it is package name and some versioning convention, followed by a build number and finally, platform information.
Example 3: MapR
MapR is a bit of different beast, worthy of an entire blog post. That aside, it should be noted that MapR maintains API compatibility with core Hadoop. This is intentional as the MapR distribution was designed differently from the ground up. When examining their documentation, the parallel to the package called “Hadoop” in the other distributions is called “Apache MapReduce API” in MapR documentation. As mentioned previously, it should also be noted that MapR supports a range of package versions for each distribution. They also logically organize their product into MapR core and ecosystem packages like Hive. Examining the MapR documentation requires you to look at the “ecosystem matrix” within a specific major version of the distribution, which in this case, is 5.1. We can see, for example, that for version 5.1, Hadoop 2.7.1 API is supported for MapR 5.0 and 5.1.
In this case we don't have ANY listing for patches to the Hadoop package in release notes. Why? Because MapR maintains API compatibility but does not repackage Hadoop directly. For other packages like Hive, there are package-specific pages detailing both the patches and their related packages. One especially nice feature is the handy links to Github and Maven.
In MapR, this is a parallel to the Hadoop package in the other distributions. This can be decoded as MapReduce (aka Hadoop 2) version 2.7.0 with build version ID (37549), followed by release state (GA) and platform. As we can see, there is a slight inconsistency between the ecosystem matrix in the doc and what is listed in the repo. This is a great example of why it pays to be a student of exactly what you are using when you take on a Hadoop distribution in your architecture.
As you can see, there are many detailed sources of information available publicly for all the main distributions available today. Many times it is fine to just understand the major and minor versioning of a distribution. As organizations grow in maturity with Hadoop, they tend to take on additional use cases and integration work. Then it is often important to understand in detail when a specific feature or a specific package has been included. We hope this glimpse into Hadoop versions across distributions helps you determine what distribution is right for your organization. If you have questions, please don’t hesitate to contact us.
About the Author
Big data & Hadoop thought-leaderMore Content by Adam Diaz