Using HBase to Create an Enterprise Key Service

March 8, 2016 Garima Dosi

Systems often use keys to provide unique identifiers for transactions or entities within the firm and provide savings via compact storage and reduced computing needs. HBase's ability to service millions of requests with good concurrency meets the needs of an enterprise key service. Assuming that you know HBase or ETL, you can work with the example in this post as long as you understand the following key points:

  • HBase stores its data in a columnar format. HBase tables consist of rows & row keys, column families, qualifiers and timestamps etc. The data is sorted in the order of bytes of row keys. HBase serves its data through regions and region servers.
  • In layman terms, facts are transactional data which are updated frequently and they are linked to slowly changing dimension data. For example, in a retail store all sales transactions or inventory can be fact data whereas products in the transaction are dimension data consisting of product id, type, color, make etc. Similarly, time and store can be other dimensions for the same transaction.

Introduction

HBase is the distributed key value NoSQL store for Hadoop. The ability to store key value pairs makes it a good choice for doing lookups. A typical ETL process involves processing of facts and dimensions; the general requirement is to perform lookups on dimension keys to link them up with fact data. More to it, if some dimensions are unavailable, the ETL process would require to generate the dimension key using some sort of a sequencing logic. To summarize, an ETL process of facts and dimensions would involve the following:

  • Start processing fact data, which is received very often.
  • Relate the fact data to dimension identifiers and attributes by looking up dimensions that are available in the fact data from the dimension database/store.
  • Generate identifiers for dimensions that do not exist in the dimension store and use them for fact processing. (This can also be termed as keying).
  • Update dimension data with the new or modified dimensions as and when required.

How does HBase help us to solve this problem?

  1. The HBase table structure is designed to store “unique dimension keys” as row keys and “dimension attributes” as qualifiers in column families. Each dimension key is also assigned a “sequence identifier (stored as a qualifier value)” unique to each dimension’s space which is replaced during lookup of the dimension. The sequence identifier is used for further ETL processing.
  2. One HBase table stores all dimensions required for a fact.

HBase Dimension Lookup Table

Row key

Column Family:Qualifier

Qualifier Value

<dimension name>+”unique dimension key”

colfam:seq_id

Sequence Identifer

<dimension name>+”unique dimension key”

colfam:dim_attrib

Dimension attributes

<dimension name>

colfam:cntr_val

Last incremented sequence number for a dimension space. This is a hbase counter value.


In this table design, fact processing would include the following steps:

  • HBase serves as the lookup mechanism to get dimension values against a particular dimension key. So, htable.get(“dimension key”) will return the sequence id and other attributes for a specific dimension.
  • HBase can be used to generate new dimension keys, if they do not exist. So, htable.incr(“dimension”) can be used to increment a counter value for the given dimension space.
  • This sequence number can be used for the new dimension by registering it as a new value using htable.put(“dimension key”,”dimension attributes & sequence number”) and then performing lookups on it.

To summarize, the pseudocode for ETL fact and dimension would look like so:

if htable.exists(“dimension key”), then
     key = htable.get(“dimension key”)
     // Use this key and other dimension attributes as required.
else
     dim_seq = htable.incr(“dimension”)
     htable.put(“dimension key”,dim_seq, dim_attributes)
end if

This logic seems to be fine unless you run a distributed job like MapReduce or Spark to do the ETL process. In that case, the logic should be modified to look like this:

if htable.exists(“dimension key”), then
     key = htable.get(“dimension key”)
     // Use this key and other dimension attributes as required.
else
     dim_seq = htable.incr(“dimension”)
     if not htable.exists(“dimension key”), then
          htable.put(“dimension key”,dim_seq, dim_attributes)
     else
          key = htable.get(“dimension key”)
         // Use this key and other dimension attributes as required.
     end if
end if

The extra check after generating the new dimension caters to the edge cases where two tasks might have generated a new dimension key for the same dimension simultaneously.

Updates for dimensions explicitly are also handled on similar lines.

Overall, this seems to be a simple and reasonable solution. However, the problem is this logic might not prove to be efficient when lots of new keys are generated by the ETL process because of the contention and locking done while doing writes into HBase. To optimize it further:

  • Use batch request APIs  for HBase GET & PUT calls.
  • Avoid using HBase EXISTS call as existence of a row key can also be checked using a GET call.
  • Ideally, split the logic mentioned above into two parts – lookups or existence of a dimension key can be checked in the map phase of a Spark or a MapReduce job and generation of new dimensions using INCR and PUT call can be deferred to a reduce phase. Both GET and PUT in their respective phases are done by using batched HBase operations.

Apart from these, the lookup performance can be improved by following normal HBase cluster tuning measures like pre-splitting regions, avoiding hotspotting by designing row keys efficiently, salted hbase tables etc. HBase cluster tuning is in itself a separate topic for discussion.

References

https://hbase.apache.org/book.html

http://blog.cloudera.com/blog/2015/06/how-to-scan-salted-apache-hbase-tables-with-region-specific-key-ranges-in-mapreduce/

http://searchdatamanagement.techtarget.com/answer/What-are-the-differences-between-fact-tables-and-dimension-tables-in-star-schemas


About the Author

Garima Dosi

Solution Architect Engineering - Guwahati, India Services

More Content by Garima Dosi
Previous Article
Harnessing Foundational Data Management Platforms to Accelerate Drug Development
Harnessing Foundational Data Management Platforms to Accelerate Drug Development

Harnessing foundational data management platforms to accelerate drug development The accelerated drug devel...

Next Article
Zaloni Partners with Sierra Ventures and Baird Capital, Raises $7.5M in Series A Financing
Zaloni Partners with Sierra Ventures and Baird Capital, Raises $7.5M in Series A Financing

At Zaloni, we have always prided ourselves on being self-funded and cash positive in this innovative and co...

×

Get the latest tips and how-to's delivered straight to your inbox!

First Name
Last Name
Zaloni Blog Email Subscription
Thank you!
Error - something went wrong!