Open Source: How Open Is It?

September 1, 2016 Adam Diaz

This is the second in a multi-part series of blogs discussing Hadoop distribution differences to help enterprises focus on the important factors involved in choosing a Hadoop distributionwithout hype or marketing spin. Zaloni has a long history of helping companies gain tangible business value from Hadoop, regardless of the distribtion. 

Many in the Big Data game claim their products are “open source.” Some even go so far as to say they are “100% open source.” A closer look reveals that most offerings are really what is called “open core.” As you might guess, this means that the core of the offering is open source; however, around the core, the company has packed a variety of levels of proprietary software.

Many enterprises considering Hadoop have already found that directly using Apache-based packages requires technological wizardry that may fall outside their group's core business or technical competency. For some organizations with large development teams a high level of complexity is not a show stopper. The reality is that not everyone wants to make a massive investment in a software engineering department in order to take on a new technology that is supposed to make things “easier.” While being Open Source does provide advantages to organizations that are fast moving and able to directly edit source code, it proves challenging to smaller organizations looking to take advantage of Hadoop’s cost savings without the same low level technical staff. The flexibility for one group is a road block for another. As a core piece of technology in modern big data architectures Hadoop and the distribution chosen is worthy of close examination prior to selection.

Defining “Open Source”

First, let’s get on the same page about what “open source” means. The Apache Software Foundation (ASF) definition of open source software includes not only posting source code so others can access it, but also a governance model. For example, determining how multiple parties can participate in a single project and how that project will be governed. This is a very different model from classic software development, as well as the simple act of sharing source code.

The danger of not having a governance model and a context for managing change in code over time is instability and branching. Just because code is shared on github does not guarantee that in the next revision a business-critical API won’t be completely removed or altered beyond usability. Long term, the development model is much like classic closed source development where consumers are at the whim of the product manager. There are no guarantees of long-term compatibility, nor any way for end users to influence product direction other than the heavy-lifting of editing the source themselves to create their own branch of the software over time. long term. This is fine for some groups -- but a project killer for others.

Hadoop distributions: No two the same!

Making things easier is essentially the motivation behind a Hadoop distribution. Someone has taken the time to take the pain out of making a collection of Hadoop and related technology work together as a unit. This “prepackaged” unit is documented, version controlled and backed with commercial support. Sometimes this packaging requires patches --  a normal part of the engineering process -- which can lead to some variation in the packaging. In addition, , some distribution companies will also include packages that they themselves have developed.

A closer examination of the packages included in any one distribution often lead to the realization that there are some very different definitions of “open source.”

The Big Three: Cloudera, Hortonworks and MapR

Cloudera. At a distribution level, Cloudera packages its  core Hadoop offering, called CDH, with management and security offerings to form a complete enterprise product called Cloudera Enterprise. The core of the offering (CDH) contains open source packages based on the Apache Software Foundation definition of open source. Other packages like Cloudera Search are “based on” other packages like Apache Solr. Value-added packages like Cloudera Navigator and Cloudera Manager, which really impart the ease of use to Cloudera’s distribution, are part of their secret sauce (aka, intellectual property).  This is completely fine and in fact, a great business model: packaging up a company’s expertise and best practices into offerings to make using Hadoop easier. Cloudera’s model is arguably the easiest-to-use Hadoop distribution today.  For example, feel free to examine the process of setting up user authentication with Kerberos across all three of the major Hadoop players. Done manually, this process can keep you up at night. You want a tool like Cloudera Manager to keep you focused on your core business via ease-of-use features. The notable exception to the Apache-based package story is Hue, the main user interface for Cloudera. This is called open source, because it shares its source code online. Source code is definitely available, but being open source is more than sharing code.

Hortonworks. All of the components of Hortonworks’ distribution are in fact Apache projects. This includes the management framework, called Ambari; their offerings for security, data governance and lineage; and their newest streaming technology offerings. Hortonworks offers no alternate version of software or withheld packages to make the distribution complete. One can go directly to the Hortonworks site and obtain the company’s full offering. Their entire distribution is done in the context of Apache projects and are the major contributors to several core Hadoop projects. They produce no closed source software for their distribution. In the interest of full disclosure, even Hortonworks depends upon number of third party products, but this is not unique to them. For example, Hive requires a database for the metastore which, in many cases, is MySQL.

MapR. MapR might be considered the most proprietary offering of the big three. This is intentional and done with a mindset of maintaining API compatibility while providing a more customized set of offerings based on a unique distributed file system architecture. This architecture has been shown to outperform HDFS and provides the foundation for a unique set of offerings upon which it is based. This foundation eventually gave rise to additional, alternate versions of NoSQL (MapR DB) and streaming (MapR Streams) offerings available in the most recent version. Each of these offerings benefit uniquely from the MapR FS foundation. Long term though, this is a story of API compatibility and not a guarantee of being 100% open source. Much like Cloudera, MapR’s core distribution is based upon Apache Hadoop projects, but surrounding that technology is a very unique set of highly performant offerings. Much of what MapR does is unknown or misunderstood by the Hadoop community at large. I would recommend that it is worth taking the time to understand MapR in detail.

What’s right for your organization?

The most open source and cutting-edge offering may not be right for your organization. In fact, most organizations don't choose a solution based upon it being open source. The ones that do almost always have a policy that demands that the solution include commercial support. The development pressure from those on the cutting edge of Hadoop in huge West Coast web properties drives the direction of Big Data projects in many cases. This doesn't always translate to the needs of Midwest and East Coast businesses with smaller pools of Hadoop talent in their local markets. Use cases like ETL offload, in many cases, when attempted in isolation, prove to be a bridge too far. The ability to adopt a platform that matches the level of technical depth for your organization is a fine balance. There are so many offerings in the market all with the promise of speed, lower cost and game-changing benefits, it is difficult to maintain a global view. The answer to what is right for you takes an honest assessment of organizational technical skill, a healthy knowledge of business requirements and, most importantly, a focused use case to start your big data journey. When you are considering adopting a new technology, the reality is that ease of use usually trumps openness or flexibility.

About the Author

Adam Diaz

Big data & Hadoop thought-leader

Previous Article
Zaloni DLM: Big Data Lifecycle Management for the Data Lake
Zaloni DLM: Big Data Lifecycle Management for the Data Lake

Apache knows there’s an urgent need for data lifecycle management for big data – and now offers Heterogeneo...

Next Article
Kafka in action: 7 steps to real-time streaming from RDBMS to Hadoop
Kafka in action: 7 steps to real-time streaming from RDBMS to Hadoop

For enterprises looking for ways to more quickly ingest data into their Hadoop data lakes, Kafka is a great...


Get a custom demo for your team.

First Name
Last Name
Phone Number
Job Title
Comments - optional
I would like to subscribe to email updates about content and events
Zaloni is committed to the best experience for you. Read more on our Privacy Policy.
Thank you! We'll be in touch!
Error - something went wrong!