By Ankhimita Paul Choudhury & Sunayan Saikia
As you might already know, Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases (MySql, Oracle, Netezza etc.).
As we hacked into Sqoop, an interesting thing that we found is the plugin framework that it supports, which lets us to create our own custom tool in Sqoop to function like any other inbuilt tools (commands) that Sqoop has, such as import, export, create-hive-table, list-tables, etc. Creating a custom tool enables us to implement our own logic into Sqoop as per our need.
You can use the following steps as a guide to develop your own Sqoop plugin from scratch.
Defining the Sqoop plugin
If we want to have a custom tool in Sqoop (apart from the tools already provided by Sqoop, such as import tool, export tool, create-hive-table tool, etc.) to implement our own specific requirements, we must then create a Sqoop plugin to have this special tool that contains all of the features that we require for our use case.
That basically involves designing our classes under the realm of norms defined in the plugin architecture that Sqoop needs us to follow. The plugin that we create will have to have a base class (plugin class), which will contain the fully developed custom tool within it.
Considering when to use a custom Sqoop plugin
The current stable version of Sqoop, being 1.4.6, is missing some of the following features:
- A simple 'Test connection' feature to a source JDBC URL
- Creating external hive tables and loading data into them
- Supporting impersonation
We can implement anything of that sort using our own custom sqoop plugin. The reason we are considering the aforementioned Sqoop version for our example is that - at the time we are writing this article - most of the Hadoop distros such as CDH, HDP, MapR officially supports only Sqoop 1.4.6.
Implementing a Sqoop plugin
The creation and implementation of a Sqoop plugin is illustrated in the following 6 steps:
1. Create a project and prepare it for developing your custom sqoop plugin
- Create a project (We suggest a Maven project).
- Get all the dependencies of Apache Sqoop [get sqoop-1.4.6.jar installed in your local Maven repository or get it from central Maven repository, if available].
Tip: If you are using Maven to build the project and want to do a local installation of the Sqoop dependency, download the Sqoop jar (here’s a link to download) and use the following command:
Then we need to put the following dependency into the Maven pom.xml (resolved from either local Maven repository or central Maven repository):
2. Create the custom tool class
Create a class with a name that ends with the word ‘Tool’ just to provide the context that it is the custom tool class (for example, AbcTool class). This custom tool class will be responsible for performing the functions that you require.
What does the custom tool class need to inherit and which methods are mandated be overridden?
BaseSqoopTool is the base class for all Sqoop Tools. So, if you intend to develop a custom tool, you need to make sure your custom tool class extends from the org.apache.sqoop.tool.BaseSqoopTool class and overrides the run(SqoopOptions options) method:
public int run(SqoopOptions options): This method acts as an entry point for execution for your custom tool.
NOTE: The following two points related to the working of the custom tool need careful consideration before proceeding:
- You are free to override any methods of BaseSqoopTool class. If you want to enhance on top of any of the present in-built tools, it is better to extend your custom class to that specific tool class, so as to override specific methods of that tool class. For example, you can extend your custom class to that of Import tool class, if you are enhancing your requirements over the existing logic of Import tool.
- Support for User-defined custom options: If you want to have custom arguments for your tool, the following 3 methods need to be overridden: (Examples of some of the existing command line arguments are: --connect <jdbcUrl> ,
--username <username> , --table <table name>) public void configureOptions(ToolOptions toolOptions) : This method is responsible for configuring the command-line arguments you expect to receive for your custom tool. You can also specify the description of all these command line arguments. When a user executes Sqoop help <custom tool> , the information which is provided in this method will be output to the user. public void applyOptions(CommandLine in, SqoopOptions out) : This method is responsible for parsing all the options and populating SqoopOptions which acts as a data transfer object during the complete execution. public void validateOptions(SqoopOptions options) : This method is responsible for providing any validations required for your options.
3. Create any number of required tertiary classes
4. Create user-defined tool plugin class [Wrapper class]
5. Build the plugin Resolve any third party dependencies that you might be using (for example, org.apache.commons, etc) by adding the dependencies onto your pom.xml file. Build the plugin through Maven or an alternative, and make sure the required jar is generated (for example, abc-sqoop-plugin-1.0.0-SNAPSHOT.jar).
6. Register the user-defined plugin with Sqoop Export the plugin jar: You need to copy your plugin jar to $SQOOP_LIB directory. Register the plugin class with Sqoop: You need to register the plugin class with Sqoop by defining the value for the property sqoop.tool.plugins in the sqoop-site.xml file which is present inside the configuration directory of Sqoop. The value for the property sqoop.tool.plugins should be the name of the plugin class prefixed with the package name. The definition should look something like below:
The description just describes what the property 'sqoop.tool.plugin' does and is self-explanatory in the above example. IMPORTANT - Verifying if the plugin got registered: After you are done with above steps, when you type 'sqoop help' in the command line you should see your plugins listed along with the inbuilt tools (commands).
Visualizing the Sqoop plugin
Hacking into classes in a Sqoop Plugin (Optional)
Hacking into classes in a Sqoop Plugin (Optional)
To bring in your own custom connection managers:
Sqoop has several built-in Connection managers classes such as NetezzaManager, PostgresqlManager and the like. All these managers more or less extend a base connection manager class called org.apache.sqoop.manager.ConnManager. Hence, The first thing we need to do to create a custom Connection Manager class is to extend this base connection manager class. Alternatively, if you wish to enhance over any of the existing Connection Manager classes such as org.apache.sqoop.manager.NetezzaManager, org.apache.sqoop.manager.PostgresqlManager, etc. then we can extend that specific class.Example:
Next you need to design your own Manager Factory class so as to instantiate the custom connection manager we created above. Existing connection managers are currently created by instances of the abstract class ManagerFactory. One ManagerFactory implementation serves all of Sqoop: org.apache.sqoop.manager.DefaultManagerFactory. ManagerFactory has a single method of note, named accept(). This method determines whether it can instantiate a ConnManager for the user’s SqoopOptions. If so, it returns the ConnManager instance. Otherwise, it returns null. So, for instantiating your own connection manager, you need to create a manager factory class that extends to the DefaultManagerFactory class and then, override the accept() method. Inside the accept() method, you need to define instance of your custom connection manager, something as like illustrated below:
The final step is to register your custom Manager Factory class [AbcManagerFactory] in the sqoop-site.xml file, present inside the conf directory of Sqoop. You need to add the property name sqoop.connection.factories inside the sqoop-site.xml, if not already present. The value for the property is meant to have the name of the Manager Factories that Sqoop needs to register with itself for instantiating the Connection Managers. Hence, you should add the name of your own custom Manager Factory in the value replacing the DefaultManagerFactory and keep the other Managers as is. This is illustrated below:
Confirming if the plugin is successfully registered
If your plugin is successfully registered after following the above steps, when you type ‘sqoop help’ in the terminal, you’ll be able to see your plugin getting registered something like what’s being highlighted in below image. Here, we are assuming you have named your plugin ‘db-import’.
About the AuthorMore Content by Sunayan Saikia