For enterprises looking for ways to more quickly ingest data into their Hadoop data lakes, Kafka is a great option. What is Kafka? Kafka is a distributed, scalable and reliable messaging system that integrates applications/data streams using a publish-subscribe model. It is a key component in the Hadoop technology stack to support real-time data analytics or monetization of Internet of Things (IOT) data.
This article is for the technical folks. Read on and I’ll diagram how Kafka can stream data from a relational database management system (RDBMS) to Hive, which can enable a real-time analytics use case. For reference, the component versions used in this article are Hive 1.2.1, Flume 1.6 and Kafka 0.9.
If you're looking for an overview of what Kafka is and what it does, check out my earlier blog published on Datafloq.
Where Kafka fits: The overall solution architecture
The following diagram shows the overall solution architecture where transactions committed in RDBMS are passed to the target Hive tables using a combination of Kafka and Flume, as well as the Hive transactions feature.
7 steps to real-time streaming to Hadoop
Now let’s dig into the solution details and I’ll show you how you can start streaming data to Hadoop in just a few steps.
1. Extract data from the relational database management system (RDBMS)
All relational databases have a log file that records the latest transactions. The first step for our streaming solution is to obtain these transactions in a format that can be passed to Hadoop. Walking through the exact mechanisms of this extraction could take up a separate blog post – so please reach out to us if you’d like more information pertaining to that process.
2. Set up the Kafka producer
Processes that publish messages to a Kafka topic are called “producers.” “Topics” are feeds of messages in categories that Kafka maintains. The transactions from RDBMS will be converted to Kafka topics. For this example, let’s consider a database for a sales team from which transactions are published as Kafka topics. The following steps are required to set up the Kafka producer:
Next we will create a Hive table that is ready to receive the sales team’s database transactions. For this example, we will create a customer table:
hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
Now let’s look at how to create a Flume agent that will source from Kafka topics and send data to the Hive table.
Follow these steps to set up the environment before starting the Flume agent:
Next create a log4j properties file as follows:
[bedrock@sandbox conf]$ vi log4j.properties
Then use the following configuration file for the Flume agent:
Use the following command to start the flume agent:
$ /usr/hdp/apache-flume-1.6.0/bin/flume-ng agent -n flumeagent1 -f ~/streamingdemo/flume/conf/flumetohive.conf
6. Start the Kafka stream
As an example, below is a simulation of the transactions messages, which in an actual system will need to be generated by the source database. For instance, the following could come from Oracle streams that replay the SQL transactions that were committed to the database, or they could come from GoldenGate.
2,"Cody Bond","email@example.com","232-513 Molestie Road","Aenean Eget Magna Incorporated"
7. Receive Hive data
With all of the above accomplished, now when you send data from Kafka, you will see the stream of data being sent to the Hive table within seconds.
Opening the door to new use cases
I hope now you have a better idea of how real-time data from a relational source can be sent to Hive and be consumed by big data applications using Kafka. Compared to traditional message brokers, Kafka is easier to scale to accommodate massive streams of data, including IoT data, to enable near real-time analytics. This ability gives enterprises a competitive advantage by enabling a wider range of use cases leveraging the Hadoop ecosystem.
About the AuthorMore Content by Rajesh Nadipalli