The concept of partitioning in Hive is very similar to what we have in RDBMS. A table can be partitioned by one or more keys. This will determine how the data will be stored in the table. For example, if a table has two columns, id, name and age; and is partitioned by age, all the rows having same age will be stored together.
Let us consider a non-partitioned managed table as follows:
We have another table with the same schema, but partitioned on the column ‘date_of_sale.’
-----Please note that the partition column need not be mentioned in the table schema separately.
Now, let us insert data from salesdata_source to salesdata. There are two ways of inserting the data:
We specify that value of the partition while inserting the data.
This will create two partitions in salesdata and will insert all the data from salesdata_source to salesdata. We can verify that by the following command:
The table will have the same data as the source table:
But internally, the data for each partition will be stored as separate files under separate subdirectories. We can see the physical storage by going to the HDFS location of the table.
So as you can see, there are sub-directories created for each partition, and the data files for each partition are stored under that.
There is another way of partitioning where we let the Hive engine dynamically determine the partitions based on the values of the partition column.
Before using this, we have to set a property that allows dynamic partition:
(This is because Dynamic Partitioning is disabled in Hive to prevent accidental creation of huge number of partitions)
-- Please note that the partitioned column should be the last column in the select clause.
In this method, Hive engine will determine the different unique values that the partition columns holds(i.e date_of_sale), and creates partitions for each value.
We can see that with the following command:
The maximum number of partitions that can be created by default is 200. We can increase this number by using the following queries:
Why do we need partitions
Let us fire two queries on the above tables:
The above query will scan the entire table in search of this particular salesperson_id and date_of_sale. But, if the same query is fired for salesdata (which is partitioned),
only the partition date_of_sale =’10-28-2017’ will be scanned thus reducing the query execution time by a large margin. The table we have used here is very small, the difference in execution time will be better understood in case of huge datasets.
External partitioned tables
So we have seen how to insert data into partitioned tables from other tables. Let’s see how to handle data that is already present in HDFS.
Consider a file /user/hive/text1.dat which has the following data:
We see that this file has data corresponding to date_of_sale=’10-27-2017’. A partitioned external table can be used to populate this data.
Now, we will create a subdirectory under this location manually, and move the file here.
So the file has been placed into the partition folder. Now we can see if the partition information is reflected in the table data:
For the partition to reflect in the table metadata, we will either have to repair the table or add partition by using the alter command that we are discussing later.
Note: Using external partitioned table is a design choice. The file from HDFS can be loaded into a managed non-partitioned table as well, and from that, into a partitioned table as discussed earlier.
Suppose we have created partitions for a table, but we need to rename a particular partition or drop a partition that got incorrectly created. We can use the ‘alter’ command in such cases:
Partition name will be changed(can be verified by show partitions), and the subdirectory name in the warehouse will be changed, but the data underneath will remain same.
Partition will be dropped and the subdirectory will be deleted.
Partition will be dropped but the subdirectory will not be deleted since this is an external table.
Conversely, if we delete the subdirectory but do not drop the partition using alter command, the partitions will remain in both external and managed tables, until we don’t execute the alter table drop partition command for the deleted partition.
Check out more of our Hive blogs:
About the Author
Senior Software EngineerMore Content by Kuwali Sarma