Performance tuning of applications in big data usually depends on 2 factors, optimization options provided by the big data technology being used, and the nature of the data being processed. Focusing on the technology is definitely a requirement, but the nature of data being processed is not something to be overlooked. The solution to achieving performance for a big data application is often hidden in the kind of data being processed.
Real world HBase example
We recently tried to use HBase as a vault to store tokenized (encrypted) information for sensitive/PII fields. Storing and retrieving tokens into and from HBase is fairly easy. The challenge comes when trying to perform the operation for huge datasets consisting of billions of rows within a defined SLA.
The story goes like this – there are 60 different kinds of files which are continuously being generated and sent. These files are supposed to be ingested into Hadoop; each file has some fields containing sensitive data which are to be tokenized (tokens being looked up and stored in HBase). The SLA is to complete the tokenization process in 5-10 minutes to handle the continuous inflow of files.
The algorithm used for tokenizing the sensitive fields and storing them in the vault in a MapReduce job is:
The algorithm is simple and worked fine when executing for records of one file type at a time or cases when there was less load on the cluster. But with simultaneous processing of a lot of files, the tokenization job would take hours to complete. The usual code optimizations recommended for HBase were already in place:
- Batch Lookups & Batch Inserts
- Region Server - Memory & GC settings
- Region Server - Read and Write Handler Count settings
However, the behavior of the job remained the same. The observation was that, no matter what we did, the region servers were being choked up on simultaneous loads. So:
- We throttled the execution of the jobs by creating a separate YARN queue with limited resources for tokenization jobs only
- Revisited the row key design – Special care was taken that the row keys were distributed well enough. Since HBase vault was bi-directional, meaning it supported both tokenization and de-tokenization, the row key was designed to support both
token->original_valuelookup. Initially, the row key design was
<column_name>_<keyword (key/token)>_<original_value/token_value>Example – imsi_key_9876545589 or imsi_token_ABC4589000. To achieve a better row key distribution, this was restructured to
This improved the overall performance of jobs since the region servers were less pressurized but the main problem still remained – the job was slow. The answer to the problem was uncovered only with further analysis of the dataset.
The answer is in the nature of the data
It was observed that the sensitive fields that were being tokenized were repetitive. Distinct values of these fields would reduce the dataset by 60-70%. As an example, in telecom probe data, phone numbers or MSISDN would repeat itself for each event generated by cell phone users, which means that in a dataset of a billion rows, MSISDN would occur again and again. While doing lookups for tokenizing this dataset, the request for all such repeating values would be served by one single region server (in HBase), slowing down the response for each call which, in turn, would affect the entire job performance. The answer was hidden in the nature of the data being processed.
The problem was then solved using the following steps:
- Take distinct values of the sensitive/PII fields
- Tokenize (batch lookups and writes) using the distinct dataset and get the tokenized value against each value from HBase vault
- Join the tokenized dataset with the original dataset again
And done! This simple mechanism greatly increased the job performance and we could achieve the SLA of completing the tokenization process in 5-10 minutes for all of the 60 datasets.
About the AuthorMore Content by Garima Dosi