The Best Ways to Get Started with HCatalog

November 16, 2016 Adam Diaz

HCatalog, also called HCat, is an interesting Apache project. It has the unique distinction of being one of the few Apache projects that were once a part of another project, became its own project, and then again returned to the original project Apache Hive.

HCat itself is described in the documentation as “a table and storage management layer” for Hadoop. In short, HCat provides an abstraction layer for accessing data in Hive from a variety of programming languages.  It exposes data stored in the Hive metastore to additional languages other than HQL. Classically, this has included Pig and MapReduce. When Spark burst onto the big data scene, it allowed access to HCat.

Using HCat means leveraging an abstraction layer that lets programmers focus on the task at hand, not file format issues. This is done using what is called a “SerDes” or serializer/deserializer. It translates a programming object into a series of bytes and back again. For those of you who are not Java programmers, this is a piece of Java code used to allow HCat and Hive to understand how to exchange information in a particular format.  

Getting Started with HCatalog

In general, you would use HCatalog to upload data to the distributed file system, define the data in Hive, and then access the data via a technology of your choice using the appropriate HCatalog statement for the language used.

Accessing Data with HCatalog

Below is a short example of HCatalog being used with a chosen technology:

Pig - Pig uses HCatLoader and HCatStorer. Please see the very detailed Hortonworks tutorial on use of HCat for full worked examples.

a = LOAD 'TABLENAME1' using org.apache.hive.hcatalog.pig.HCatLoader () ;
b = LOAD 'TABLENAME2' using org.apache.hive.hcatalog.pig.HCatLoader () ;
c = join b by colname1, a by colname1;
dump c;

Hive - Hive uses HCat directly so there is no need for special code. Simply define your table as you would in the Hive CLI and it will be accessible via HCat. View a Hive architecture.

MapReduce - MapReduce can also access data via HCat. See a fully worked example is available here. In short, adjust your mapper, reducer and driver to use HCat.

// Get table schema in mapper
    HCatSchema schema = HCatBaseInputFormat.getTableSchema(context) ;
    Integer var1= new Integer(value.getString("var1", schema) ) ;
 
    // define output record schema
    List columns = new ArrayList (3) ;
    columns.add(new HCatFieldSchema("year", HCatFieldSchema.Type.INT, "") ) ;
     record.setInteger ("year", schema, key.getFirstInt() ) ;

SparkSQL - Spark can leverage several languages including Scala, Python, Java and R. Of course, one can simply use Spark SQL to simply run native HQL commands (which natively interact with Hcat).

val a =  hiveContext.hql (“from data.test select country, prodID”)

Spark - What if you would like to access Hive from Spark without Spark SQL? Spark code accesses the Hive metastore directly.

Interacting with HCatalog through WebHCat

Simply put WebHCat is the REST API for HCatalog. This allows for all sorts of scenarios where interacting with HCatalog might be required but cannot be done using other methods. The easiest way to demonstrate WebHCat is via curl. You will notice the name “Templeton” in the URL which is the old name for WebHCat.

curl -s 'http://localhost:50111/templeton/v1/status'
{"status":"ok","version":"v1"}
 
curl -s 'http://localhost:50111/templeton/v1/ddl/database/default/table/sample_07?user.name=hive'
 
{  
  "columns":[  
     {  
        "name":"code",
        "type":"string"
     },
     {  
        "name":"description",
        "type":"string"
     },
     {  
        "name":"total_emp",
        "type":"int"
     },
     {  
"name":"salary",
        "type":"int"
     }
  ],
  "database":"default",
  "table":"sample_07"
}
 

Connecting to Hive with HiveServer2

HiveServer2 (HS2) is a connection layer to allow client connections to Hive. This includes a TCP or HTTP based Hive Service layer and like most Hadoop services a web interface. One of the easiest ways to connect is to use the built in client called beeline that comes with Hive. This is the technology that allows many BI tools in the Hadoop market to make use of Hive today.

beeline
WARNING: Use "yarn jar" to launch YARN applications.
Beeline version 1.2.1000.2.4.0.0-169 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000/default
Connecting to jdbc:hive2://localhost:10000/default
Enter username for jdbc:hive2://localhost:10000/default: hive
Enter password for jdbc:hive2://localhost:10000/default: ****
Connected to: Apache Hive (version 1.2.1000.2.4.0.0-169)
Driver: Hive JDBC (version 1.2.1000.2.4.0.0-169)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000/default> show databases;
+----------------+--+
| database_name  |
+----------------+--+
| default        |
| xademo         |
+----------------+--+
2 rows selected (2.867 seconds)
0: jdbc:hive2://localhost:10000/default>
 
You can think of HS2 as a templeton based service allowing remote access to the Hive command line. So again, in this scenario any tables created will automatically be available via HCatalog since you are essentially working at the Hive cli.
 
0: jdbc:hive2://localhost:10000/default> show tables;
+------------+--+
|  tab_name  |
+------------+--+
| sample_07  |
| sample_08  |
+------------+--+
2 rows selected (0.381 seconds)
0: jdbc:hive2://localhost:10000/default> create table testtable( eid int);
No rows affected (10.69 seconds)
0: jdbc:hive2://localhost:10000/default> show tables;
+------------+--+
|  tab_name  |
+------------+--+
| sample_07  |
| sample_08  |
| testtable  |
+------------+--+
3 rows selected (0.351 seconds)
 
hcat -e "show tables;"
WARNING: Use "yarn jar" to launch YARN applications.
OK
sample_07
sample_08
testtable
Time taken: 8.589 seconds
 

Learn more about HCatalog and compatible technologies

HCatalog is a way for many different technologies to share in the tables defined in Hive without having to write low level integration with the Hive Metastore. Without HCatalog, the ability to simply reuse existing data becomes more cumbersome. Aside from the fact that Hive is the technology in Hadoop that looks and feels the most like everyone’s beloved RDBMS, HCatalog is what allows a multitude of Hadoop command line tools to interact with Hive. HS2 then provides the easiest way for a sea of BI tools to connect to Hive and leverage tables in Hive directly.
 
For more about the technologies HCatalog is compatible with, check out our article about versioning.
 

About the Author

Adam Diaz

Director of Field Engineering Sales - RTP Raleigh NC

More Content by Adam Diaz
Previous Article
Tez and LLAP Improvements to Make Hive Faster
Tez and LLAP Improvements to Make Hive Faster

Before the days of Spark, there was a huge Cloudera vs Hortonworks fight over what was to be the SQL/RDBMS ...

Next Article
Are You Ready for the Future of Big Data?
Are You Ready for the Future of Big Data?

A few weeks ago, I went to the IBM World of Watson conference in Las Vegas, NV. Being one of the (roughly) ...

×

Get the latest tips and how-to's delivered straight to your inbox!

First Name
Last Name
Zaloni Blog Email Subscription
Thank you!
Error - something went wrong!