Partitioning in Hive

September 20, 2017 Kuwali Sarma

The concept of partitioning in Hive is very similar to what we have in RDBMS. A table can be partitioned by one or more keys. This will determine how the data will be stored in the table. For example, if a table has two columns, id, name and age; and is partitioned by age, all the rows having same age will be stored together.

Partitioned tables

Let us consider a non-partitioned managed table as follows:

 create table salesdata_source
(salesperson_id int,
product_id int,
date_of_sale string)
salesperson_id product_id date_of_sale
12 101 10-27-2017
10 10010 10-27-2017
111 2010 10-27-2017
13 222 10-28-2017
15 235 10-28-2017

We have another table with the same schema, but partitioned on the column ‘date_of_sale.’

create table salesdata
(salesperson_id int,
product_id int)
partitioned by (date_of_sale string)


-----Please note that the partition column need not be mentioned in the table schema separately.

Now, let us insert data from salesdata_source to salesdata. There are two ways of inserting the data:

Static partitioning

We specify that value of the partition while inserting the data.

insert into table salesdata partition (date_of_sale=’10-27-2017’)
select * from salesdata_source where date_of_sale=’10-27-2017’;
insert into table salesdata partition (date_of_sale=’10-28-2017’)
select * from salesdata_source where date_of_sale=’10-28-2017’;


This will create two partitions in salesdata and will insert all the data from salesdata_source to salesdata. We can verify that by the following command:

show partitions salesdata;
O/p:
date_of_sale=’10-27-2017’
date_of_sale=’10-28-2017’


The table will have the same data as the source table:

salesperson_id product_id date_of_sale
12 101 10-27-2017
10 10010 10-27-2017
111 2010 10-27-2017
13 222 10-28-2017
15 235 10-28-2017

But internally, the data for each partition will be stored as separate files under separate subdirectories. We can see the physical storage by going to the HDFS location of the table.
 

hadoop fs -ls /apps/hive/warehouse/db_name.db/salesdata/
O/p:
/apps/hive/warehouse/db_name.db/salesdata/date_of_sale=10-27-2017
/apps/hive/warehouse/db_name.db/salesdata/date_of_sale=10-28-2017


So as you can see, there are sub-directories created for each partition, and the data files for each partition are stored under that.

Dynamic partitioning

There is another way of partitioning where we let the Hive engine dynamically determine the partitions based on the values of the partition column.

Before using this, we have to set a property that allows dynamic partition:
 

set hive.exec.dynamic.partition.mode=nonstrict;


(This is because Dynamic Partitioning is disabled in Hive to prevent accidental creation of huge number of partitions)
 

hive> insert into table salesdata partition (date_of_sale)
select salesperson_id,product_id,date_of_sale from salesdata_source ;


-- Please note that the partitioned column should be the last column in the select clause.

In this method, Hive engine will determine the different unique values that the partition columns holds(i.e date_of_sale), and creates partitions for each value.

We can see that with the following command:
 

hive> show partitions salesdata;
date_of_sale=’10-27-2017’
date_of_sale=’10-28-2017’


The maximum number of partitions that can be created by default is 200. We can increase this number by using the following queries:
 

set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;


Why do we need partitions

Let us fire two queries on the above tables:
 

select * from salesdata_source where salesperson_id=12 and date_of_sale =’10-28-2017’


The above query will scan the entire table in search of this particular salesperson_id and date_of_sale. But, if the same query is fired for salesdata (which is partitioned),
 

select * from salesdata where salesperson_id=12 and date_of_sale =’10-28-2017’


only the partition date_of_sale =’10-28-2017’ will be scanned thus reducing the query execution time by a large margin. The table we have used here is very small, the difference in execution time will be better understood in case of huge datasets.

External partitioned tables

So we have seen how to insert data into partitioned tables from other tables. Let’s see how to handle data that is already present in HDFS.

Consider a file /user/hive/text1.dat which has the following data:

salesperson_id|product_id|date_of_sale
12|101|10-27-2017
10|10010|10-27-2017
111|2010|10-27-2017

We see that this file has data corresponding to date_of_sale=’10-27-2017’. A partitioned external table can be used to populate this data.
 

create external table salesdata_ext
(salesperson_id int,
product_id int)
partitioned by (date_of_sale string)
location ‘/user/hive/salesdata_ext/’


Now, we will create a subdirectory under this location manually, and move the file here.
 

hadoop fs -mkdir /user/hive/salesdata_ext/date_of_sale=10-27-2017
hadoop fs -cp /user/hive/text1.dat /user/hive/salesdata_ext/date_of_sale=10-27-2017/


So the file has been placed into the partition folder. Now we can see if the partition information is reflected in the table data:
 

Show partitions salesdata_ext;
--0


For the partition to reflect in the table metadata, we will either have to repair the table or add partition by using the alter command that we are discussing later.
 

msck repair table salesdata_ext;
show partitions salesdata_ext;
O/p:
date_of_sale=10-27-2017


OR
 

alter table salesdata_ext add partition (date_of_sale=’10-27-2017’);
show partitions salesdata_ext;
O/p:
date_of_sale=10-27-2017


Note: Using external partitioned table is a design choice. The file from HDFS can be loaded into a managed non-partitioned table as well, and from that, into a partitioned table as discussed earlier.

Alter partitions

Suppose we have created partitions for a table, but we need to rename a particular partition or drop a partition that got incorrectly created. We can use the ‘alter’ command in such cases:
 

1. alter table salesdata partition (date_of_sale=10-27-2017) rename to partition (date_of_sale=10-27-2018);

Partition name will be changed(can be verified by show partitions), and the subdirectory name in the warehouse will be changed, but the data underneath will remain same.
 

2. alter table salesdata drop partition (date_of_sale=10-27-2017) ; (internal table)

Partition will be dropped and the subdirectory will be deleted.
 

3. alter table salesdata_ext drop partition (date_of_sale=10-27-2017) ; (external table)

Partition will be dropped but the subdirectory will not be deleted since this is an external table.

Conversely, if we delete the subdirectory but do not drop the partition using alter command, the partitions will remain in both external and managed tables, until we don’t execute the alter table drop partition command for the deleted partition.

Check out more of our Hive blogs:

Hive Basics - Elasticsearch Integration                           Tez and LLAP Improvements to Make Hive Faster

About the Author

Kuwali Sarma

Senior Software Engineer

More Content by Kuwali Sarma
Previous Article
An Operationalized Data Lake is Your Best Bet for GDPR Compliance
An Operationalized Data Lake is Your Best Bet for GDPR Compliance

Don't think of a data lake as just a repository of data. When managed with a holistic platform a data lake ...

Next Article
Leveraging Big Data Lessons to Optimize Data Lakes for IoT
Leveraging Big Data Lessons to Optimize Data Lakes for IoT

Zaloni has optimized data lakes and data management for the next big data surge – the avalanche of data fro...

×

Subscribe to the latest data lake expertise!

First Name
Last Name
Company
I would like to subscribe to email updates about content and events.
Zaloni is committed to the best experience for you. Read more on our Privacy Policy.
Thank you!
Error - something went wrong!