Performance Tuning an HBase Vault

July 10, 2017 Garima Dosi

Performance tuning of applications in big data usually depends on 2 factors, optimization options provided by the big data technology being used, and the nature of the data being processed. Focusing on the technology is definitely a requirement, but the nature of data being processed is not something to be overlooked. The solution to achieving performance for a big data application is often hidden in the kind of data being processed.

Real world HBase example

We recently tried to use HBase as a vault to store tokenized (encrypted) information for sensitive/PII fields. Storing and retrieving tokens into and from HBase is fairly easy. The challenge comes when trying to perform the operation for huge datasets consisting of billions of rows within a defined SLA.

The story goes like this – there are 60 different kinds of files which are continuously being generated and sent. These files are supposed to be ingested into Hadoop; each file has some fields containing sensitive data which are to be tokenized (tokens being looked up and stored in HBase). The SLA is to complete the tokenization process in 5-10 minutes to handle the continuous inflow of files.

The algorithm used for tokenizing the sensitive fields and storing them in the vault in a MapReduce job is:

In each mapper,
For each column to be tokenized,
Lookup on HBase for existing token (One HBase call)
if token does not exist in vault
Generate and insert a new token in HBase (Another HBase call)

The algorithm is simple and worked fine when executing for records of one file type at a time or cases when there was less load on the cluster. But with simultaneous processing of a lot of files, the tokenization job would take hours to complete. The usual code optimizations recommended for HBase were already in place:

  • Batch Lookups & Batch Inserts
  • Region Server - Memory & GC settings
  • Region Server - Read and Write Handler Count settings

However, the behavior of the job remained the same. The observation was that, no matter what we did, the region servers were being choked up on simultaneous loads. So:

  1. We throttled the execution of the jobs by creating a separate YARN queue with limited resources for tokenization jobs only 
  2. Revisited the row key design – Special care was taken that the row keys were distributed well enough. Since HBase vault was bi-directional, meaning it supported both tokenization and de-tokenization, the row key was designed to support both original_value->token and token->original_value lookup. Initially, the row key design was  <column_name>_<keyword (key/token)>_<original_value/token_value> Example – imsi_key_9876545589 or imsi_token_ABC4589000. To achieve a better row key distribution, this was restructured to <original_value/token_value>_<keyword (key/token)>_<column_name>

This improved the overall performance of jobs since the region servers were less pressurized but the main problem still remained – the job was slow. The answer to the problem was uncovered only with further analysis of the dataset.

The answer is in the nature of the data

It was observed that the sensitive fields that were being tokenized were repetitive. Distinct values of these fields would reduce the dataset by 60-70%. As an example, in telecom probe data, phone numbers or MSISDN would repeat itself for each event generated by cell phone users, which means that in a dataset of a billion rows, MSISDN would occur again and again. While doing lookups for tokenizing this dataset, the request for all such repeating values would be served by one single region server (in HBase), slowing down the response for each call which, in turn, would affect the entire job performance. The answer was hidden in the nature of the data being processed.

The problem was then solved using the following steps:

  • Take distinct values of the sensitive/PII fields
  • Tokenize (batch lookups and writes) using the distinct dataset and get the tokenized value against each value from HBase vault
  • Join the tokenized dataset with the original dataset again

And done! This simple mechanism greatly increased the job performance and we could achieve the SLA of completing the tokenization process in 5-10 minutes for all of the 60 datasets.

About the Author

Garima Dosi

Solution Architect Engineering - Guwahati, India Services

More Content by Garima Dosi
Previous Article
Cluster Disposability in Hadoop
Cluster Disposability in Hadoop

With the advent of new technologies and techniques, it's now possible to have an entire cluster be disposab...

Next Article
Make Big Data Simple [Infographic]
Make Big Data Simple [Infographic]

Zaloni's Data Lake in a Box solution is here to simplify your big data journey.

Want a governed, self-service data lake?

Contact Us