HCatalog, also called HCat, is an interesting Apache project. It has the unique distinction of being one of the few Apache projects that were once a part of another project, became its own project, and then again returned to the original project Apache Hive.
HCat itself is described in the documentation as “a table and storage management layer” for Hadoop. In short, HCat provides an abstraction layer for accessing data in Hive from a variety of programming languages. It exposes data stored in the Hive metastore to additional languages other than HQL. Classically, this has included Pig and MapReduce. When Spark burst onto the big data scene, it allowed access to HCat.
Using HCat means leveraging an abstraction layer that lets programmers focus on the task at hand, not file format issues. This is done using what is called a “SerDes” or serializer/deserializer. It translates a programming object into a series of bytes and back again. For those of you who are not Java programmers, this is a piece of Java code used to allow HCat and Hive to understand how to exchange information in a particular format.
Getting Started with HCatalog
In general, you would use HCatalog to upload data to the distributed file system, define the data in Hive, and then access the data via a technology of your choice using the appropriate HCatalog statement for the language used.
Accessing Data with HCatalog
Below is a short example of HCatalog being used with a chosen technology:
Pig - Pig uses HCatLoader and HCatStorer. Please see the very detailed Hortonworks tutorial on use of HCat for full worked examples.
c = join b by colname1, a by colname1;
Hive - Hive uses HCat directly so there is no need for special code. Simply define your table as you would in the Hive CLI and it will be accessible via HCat. View a Hive architecture.
MapReduce - MapReduce can also access data via HCat. See a fully worked example is available here. In short, adjust your mapper, reducer and driver to use HCat.
HCatSchema schema = HCatBaseInputFormat.getTableSchema(context) ;
Integer var1= new Integer(value.getString("var1", schema) ) ;
List columns = new ArrayList (3) ;
columns.add(new HCatFieldSchema("year", HCatFieldSchema.Type.INT, "") ) ;
SparkSQL - Spark can leverage several languages including Scala, Python, Java and R. Of course, one can simply use Spark SQL to simply run native HQL commands (which natively interact with Hcat).
Spark - What if you would like to access Hive from Spark without Spark SQL? Spark code accesses the Hive metastore directly.
Interacting with HCatalog through WebHCat
Simply put WebHCat is the REST API for HCatalog. This allows for all sorts of scenarios where interacting with HCatalog might be required but cannot be done using other methods. The easiest way to demonstrate WebHCat is via curl. You will notice the name “Templeton” in the URL which is the old name for WebHCat.
Connecting to Hive with HiveServer2
HiveServer2 (HS2) is a connection layer to allow client connections to Hive. This includes a TCP or HTTP based Hive Service layer and like most Hadoop services a web interface. One of the easiest ways to connect is to use the built in client called beeline that comes with Hive. This is the technology that allows many BI tools in the Hadoop market to make use of Hive today.
Learn more about HCatalog and compatible technologies
About the Author
Big data & Hadoop thought-leaderMore Content by Adam Diaz