Does Hadoop have the ability to support transactions? This is one of the most common questions I hear from folks new to Hadoop and searching for the best technology for their specific use case. Folks from the RDBMS world tend to initially look to map what is known from years of experience into similar functionality in Hadoop. Obviously, data being provided via transactions is a common use case. Unfortunately, many still have not heard of the improvements made in Hive since the early days of the Hadoop wars or more specifically the Hive/Impala wars. Of course, Spark has also burst onto the scene but to a large degree many folks still use Hive as the go-to technology, especially where integration with their favorite BI tool is involved.
Until recently Hive was not able to handle transactions. At or around Hive version 0.13, functionality was added to help with ACID transactions. Disabled by default, Hive users can now turn on the Hive transaction manager to enable transactional capabilities in Hive. Enabling a storage layer for streaming use case is particularly useful. Before, Transaction Hbase was really the go-to technology. Since HDFS doesn’t allow for updates, the new transactional functionality operates off of a system of Base tables plus a set of change files stored in an alternate directory. Much like the Namenode FSImage and secondary Namenode, the base table and change files are merged (compacted) to form a new table. This operation can be done manually or automatically. For example, it could be defined as property of the table when it is defined. In short though this overcomes a long term difference between Hive and RDBMS features based upon a read-write file system. This gives Hive the ability to ‘update’ on a write once read many file system. This probably would have been a much easier change on MapR-FS which is a truly read-write distributed file system.
There are some limitations for Hive Transactions that are well known - such as statements available and file formats supported. In the robust Hive SQL compliance conversation, transactions join a list of many others. It’s common for those looking to move from RDBMS or at a minimum augment their enterprise data warehousing architecture with Hadoop to join this conversation. Other topics like file format and statement types are commonly cited by implementers.
One of the coolest new capabilities is the Hcatalog Streaming Mutations API that upstaged the Hive HCatalog Streaming API as of Hive 2.0 (See HIVE-10165) specifically for Streaming use cases with Hive transactions. This API handles the basics of interfacing Hive with Streaming technology but also includes important features like Dynamic partition creation to make your life easier.
There are undoubtedly many other improvements to Hive over the course of the versions mentioned above. Having heard the misstatement that “Hive does not support transactions” on so many occasions it seemed important to write a quick summary to aggregate the latest information on this feature. Explore more about Hive and the Hadoop Ecosystem.
About the AuthorMore Content by Adam Diaz