Hadoop and Transactions are Not What You Think

November 29, 2016 Adam Diaz

Does Hadoop have the ability to support transactions? This is one of the most common questions I hear from folks new to Hadoop and searching for the best technology for their specific use case. Folks from the RDBMS world tend to initially look to map what is known from years of experience into similar functionality in Hadoop. Obviously, data being provided via transactions is a common use case. Unfortunately, many still have not heard of the improvements made in Hive since the early days of the Hadoop wars or more specifically the Hive/Impala wars. Of course, Spark has also burst onto the scene but to a large degree many folks still use Hive as the go-to technology, especially where integration with their favorite BI tool is involved.

Until recently Hive was not able to handle transactions. At or around Hive version 0.13, functionality was added to help with ACID transactions. Disabled by default, Hive users can now turn on the Hive transaction manager to enable transactional capabilities in Hive. Enabling a storage layer for streaming use case is particularly useful. Before, Transaction Hbase was really the go-to technology. Since HDFS doesn’t allow for updates, the new transactional functionality operates off of a system of Base tables plus a set of change files stored in an alternate directory. Much like the Namenode FSImage and secondary Namenode, the base table and change files are merged (compacted) to form a new table. This operation can be done manually or automatically. For example, it could be defined as property of the table when it is defined. In short though this overcomes a long term difference between Hive and RDBMS features based upon a read-write file system. This gives Hive the ability to ‘update’ on a write once read many file system. This probably would have been a much easier change on MapR-FS which is a truly read-write distributed file system.

There are some limitations for Hive Transactions that are well known - such as statements available and file formats supported. In the robust Hive SQL compliance conversation, transactions join a list of many others. It’s common for those looking to move from RDBMS or at a minimum augment their enterprise data warehousing architecture with Hadoop to join this conversation. Other topics like file format and statement types are commonly cited by implementers.

One of the coolest new capabilities is the Hcatalog Streaming Mutations API that upstaged the Hive HCatalog Streaming API  as of Hive 2.0 (See HIVE-10165) specifically for Streaming use cases with Hive transactions. This API handles the basics of interfacing Hive with Streaming technology but also includes important features like Dynamic partition creation to make your life easier.

There are undoubtedly many other improvements to Hive over the course of the versions mentioned above. Having heard the misstatement that “Hive does not support transactions” on so many occasions it seemed important to write a quick summary to aggregate the latest information on this feature. Explore more about Hive and the Hadoop Ecosystem.

About the Author

Adam Diaz

Director of Field Engineering Sales - RTP Raleigh NC

More Content by Adam Diaz
Previous Article
Train Your (Hadoop) Elephant with Fewer Data Lake Management and Governance Tools
Train Your (Hadoop) Elephant with Fewer Data Lake Management and Governance Tools

In the past year, the focus of big data has expanded from creating new streaming and computing frameworks i...

Next Article
Big Data Maturity Stages: Is Your Data Ready to Be a Product?
Big Data Maturity Stages: Is Your Data Ready to Be a Product?

The idea of turning your business data into a product, also termed “data as a product,” is a known concept ...

×

Get the latest tips and how-to's delivered straight to your inbox!

First Name
Last Name
Zaloni Blog Email Subscription
Thank you!
Error - something went wrong!