Pig vs. Hive: Is There a Fight?

October 5, 2016 Manoj Gogoi

Pig and Hive came into existence out of the sole necessity for enterprises to interact with huge amounts of data without worrying much about writing complex MapReduce code. Though it was born out of necessity, they have come a long way to run even on top of other Big Data processing engines like Spark. Both these two components of the Hadoop ecosystem provide a layer of abstraction over these core execution programs. Hive was invented to give people something that looked like SQL and would ease the transition from RDBMS. Pig has more of a procedural approach and it was created so people didn’t have to write MapReduce in order to manipulate data.

When to Harvest Benefits from Hive

Apache Hive is a terrific Big Data component when it comes to data summarization and extraction. It’s undoubtedly an ideal tool to work on data that already has a schema associated with it. On the other hand, the Hive metastore facilitates partitioning of all data based upon user specified conditions that further makes data retrieval faster. However, one should be careful in using an excessive number of partitions in a single query because it could lead to either of the following issues:

  • An increase in number of partitions in the query means that the number of paths associated with them will also increase. Let’s say there is a use case which has to run a query over a table of 10,000 top-level partitions and each partition is comprised of more nested partitions. For those of us who know or may not be aware of, Hive tries to set the paths of all the partitions in the job configuration while translating the query into a MapReduce job. Hence, the number of partitions directly impacts the size of the job. Since the default jobconf size is set to 5MB, exceeding the limit would incur a runtime execution failure. For example, it would state something like - "java.io.IOException: Exceeded max jobconf size: 14762369 limit: 5242880". You can find the related details here.
  • Bulk registration of partitions (for example - 10,000 * 1,00000 partitions) via “MSCK REPAIR TABLE tablename" also has its restrictions owing to Hadoop Heap size and GCOverheadlimit. Crossing the limitation would definitely lead to erroneous outcome or entire execution collapse through the stackoverflow error as stated below:
Exception in thread "main" java.lang.StackOverflowError
      at org.datanucleus.query.expression.ExpressionCompiler.isOperator(ExpressionCompiler.java:819)
      at org.datanucleus.query.expression.ExpressionCompiler.compileOrAndExpression(ExpressionCompiler.java:190)
      at org.datanucleus.query.expression.ExpressionCompiler.compileExpression(ExpressionCompiler.java:179)
      at org.datanucleus.query.expression.ExpressionCompiler.compileOrAndExpression(ExpressionCompiler.java:192)
      at org.datanucleus.query.expression.ExpressionCompiler.compileExpression(ExpressionCompiler.java:179)
  • Using extensively complex multi-level operations such as joins over numerous partitions has its limits, as well. Big queries might fail at the time when Hive Compiler does semantic validation with the metastore. Because the Hive metastore is primarily an SQL schema storage, large queries could fail with a similar exception like 'com.mysql.jbdc.PacketTooBigException: Packet for query is too large'.

The above properties such as the jobconf size, Hadoop heap size, and the packet size are undoubtedly configurable. To avoid these issues, put emphasis on having a better design of the semantics rather than frequently changing the configurations.

The optimum benefit of Hive can be derived based on a systematic schema design over the data residing in HDFS. This may include an approach where an acceptable number of partitions each holding a large chunk of data is used rather than an excessive number of partitions with less data in each partition. After all, the concept of partitioning is meant for querying specific data faster eliminating the need to operate on the entire dataset. A reduction in the number of partitions would foster minimal load on the metastore and maximum resource utilization of the cluster.

When to Make the Pig Grunt

Apache Pig has a very huge appetite and it can consume all sorts of data no matter if it’s structured, semi-structured or unstructured. Unlike Hive, it doesn’t have any metastore associated with it but it can leverage Hcatalog in Hive. In fact, Pig was created to operate on complex extensible operations on large datasets and for the reason that it could carry out self-optimizations on the go. Even though Pig has a multi-level script outlook, internally multiple operations are optimized at execution time which reduces the number of data scans.

Let’s consider the above situation using the 10,000 partitions that were used in our Hive example. In this case, we will use Pig on the same dataset. Since there is no metastore associated, the concept of partitioning doesn’t hold up for Pig alone. To make use of Hcatalog in Hive, the script for Pig can be written as (Pig 0.15 has been used):

/* myscript.pig */
A = LOAD 'tablename' USING  org.apache.hive.hcatalog.pig.HCatLoader();

-- date is a partition column; age is not
B = filter A by date == '20100819' and age < 30;

-- both date and country are partition columns
C = filter A by date == '20100819' and country == 'US';

But let’s say there exist numerous partitions and to query all of them in a single request through the usage of Hcatalog in Pig might lead to the same issues related to hive. Then it would be more convenient to make use of globs and wildcards instead.

For example: 

Partition-1, Partition-2, Partition-3,....Partition-n exist within the location /user/inputLocation/

Using globs we can provide the input to Pig as:

/user/inputLocation/{Partition-1, Partition-2, Partition-3,....Partition-n}

And with wildcards it would be:


And in case of nested partitions, we can have a combination of globs and wildcards, such as:

/user/inputLocation/{Partition-1,Partition-2, Partition-3,....Partition-n}/*

Pig will happily read the data from the locations and perform the operations from its optimized execution plan. The only hindrance for Pig in this case could be from resource unavailability from the cluster. Also, in scenarios where numerous transformations will be be made on the data, Apache Pig is arguably the one which stands tall.

Operating Between Hive and Pig

The following information will provide a glimpse into the world of Hive and Pig and how they operate.



It is well-known that Apache Hive is primarily a data warehouse platform that helps in interaction with huge sets of structured data residing either in HDFS or HBase store. The Hive Query Language used in this regard is very similar to SQL that integrate quite well with Hadoop. Unlike PIG the execution process here is purely declarative in nature and is ideal for data scientists engaged with data presentation and analysis.

When it comes to interaction with Hive, users can connect directly via Hive command line interface and through integration with Hiveserver. Any query that is submitted is first taken up by the driver and is accompanied by the compiler to validate the query both syntactically and semantically. Also the Hive metastore which is a store-house of the schemas/mapping for all the data associated with Hive plays an important role in assisting the compiler for the semantic verification of the information present in the query. The Driver runs optimization principles on top of the semantics and prepares the execution plan and submits it to the HQL processing engine which in turn generates the equivalent programmatic representation of the query depending upon the execution engine (MapReduce, Spark etc.). Any successful changes done to the schema is updated on the metastore via HQL processing engine.

You can also refer to:



Apache PIG provides a high-level language platform for operations and analysis of enormous datasets whether it be structured or unstructured data. The language termed as Pig Latin is generally a form of script that can be executed directly on the PIG shell or can be triggered via Pig Server. This user created script in the beginning stage through the Pig Latin Processing Engine gets parsed for its syntax validity and transforms into a Directed Acyclic Graph comprising of the initial logical plan of the entire execution. Furthermore, the processing engine accepts the DAG (Directed Acyclic Graph) and internally carries out optimizations on the plan which is facilitated by PIG’s procedural approach and the usage of lazy evaluation during the course of execution.

In order to understand the behavior of the optimizer let’s consider a scenario where a user writes a script that has a join operation on two datasets and is followed by a single filter criterion. PIG’s optimizer would validate whether the filter operation could be taken ahead of join so that the load on the latter would be minimized for better efficiency. And it would accordingly design the logical plan. This allows the user to remain focused on the end result rather than be worried about the performance.

Compilation comes into action only after the fully optimized logical plan is ready. It is responsible for generation of the physical plan as per the assigned execution engine which eventually interacts with the data residing in HDFS.  

You can also refer to:



Pig and Hive undoubtedly have become an integral part of the Big Data world. Both provide flexibility and extensibility to incorporate custom made functionalities and both have their own staunch identity in their behavior. It depends upon project requirements which tool would work best.

Previous Article
Migrating On-Premises Data Lakes to Cloud
Migrating On-Premises Data Lakes to Cloud

Learn how you can migrate your on-premises data lake to the cloud for cost reduction and without loss of se...

Next Article
Top Streaming Technologies for Data Lakes and Real-Time Data
Top Streaming Technologies for Data Lakes and Real-Time Data

More than ever, streaming technologies are at the forefront of the Hadoop ecosystem. This post is meant to ...


Get a custom demo for your team.

First Name
Last Name
Phone Number
Job Title
Comments - optional
I would like to subscribe to email updates about content and events
Zaloni is committed to the best experience for you. Read more on our Privacy Policy.
Thank you! We'll be in touch!
Error - something went wrong!