How to Build Your Own Sqoop Plugin

July 24, 2018 Sunayan Saikia

By Ankhimita Paul Choudhury & Sunayan Saikia

As you might already know, Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases (MySql, Oracle, Netezza etc.).

As we hacked into Sqoop, an interesting thing that we found is the plugin framework that it supports, which lets us to create our own custom tool in Sqoop to function like any other inbuilt tools (commands) that Sqoop has, such as import, export, create-hive-table, list-tables, etc. Creating a custom tool enables us to implement our own logic into Sqoop as per our need.

You can use the following steps as a guide to develop your own Sqoop plugin from scratch.

Defining the Sqoop plugin

If we want to have a custom tool in Sqoop (apart from the tools already provided by Sqoop, such as import tool, export tool, create-hive-table tool, etc.) to implement our own specific requirements, we must then create a Sqoop plugin to have this special tool that contains all of the features that we require for our use case.

That basically involves designing our classes under the realm of norms defined in the plugin architecture that Sqoop needs us to follow. The plugin that we create will have to have a base class (plugin class), which will contain the fully developed custom tool within it.

Considering when to use a custom Sqoop plugin

The current stable version of Sqoop, being 1.4.6, is missing some of the following features:

  • A simple 'Test connection' feature to a source JDBC URL
  • Creating external hive tables and loading data into them
  • Supporting impersonation

We can implement anything of that sort using our own custom sqoop plugin. The reason we are considering the aforementioned Sqoop version for our example is that - at the time we are writing this article - most of the Hadoop distros such as CDH, HDP, MapR officially supports only Sqoop 1.4.6.

Implementing a Sqoop plugin

The creation and implementation of a Sqoop plugin is illustrated in the following 6 steps:

1. Create a project and prepare it for developing your custom sqoop plugin

  • Create a project (We suggest a Maven project).
  • Get all the dependencies of Apache Sqoop [get sqoop-1.4.6.jar installed in your local Maven repository or get it from central Maven repository, if available].

Tip: If you are using Maven to build the project and want to do a local installation of the Sqoop dependency, download the Sqoop jar (here’s a link to download) and use the following command:

mvn install:install-file -Dfile=<location-of-the-sqoop-jar> -DgroupId=org.apache.sqoop -DartifactId=sqoop -Dversion=1.4.6 -Dpackaging=JAR -DgeneratePom=true


Then we need to put the following dependency into the Maven pom.xml (resolved from either local Maven repository or central Maven repository):

<dependency>
   <groupId>
org.apache.sqoop</groupId>
   <artifactId>
sqoop</artifactId>
   <version>1.4.6
</version>
</dependency>


2. Create the custom tool class

Create a class with a name that ends with the word ‘Tool’ just to provide the context that it is the custom tool class (for example, AbcTool class). This custom tool class will be responsible for performing the functions that you require.

What does the custom tool class need to inherit and which methods are mandated be overridden?

BaseSqoopTool is the base class for all Sqoop Tools. So, if you intend to develop a custom tool, you need to make sure your custom tool class extends from the org.apache.sqoop.tool.BaseSqoopTool class and overrides the run(SqoopOptions options) method:

public int run(SqoopOptions options): This method acts as an entry point for execution for your custom tool.

NOTE: The following two points related to the working of the custom tool need careful consideration before proceeding:

  • You are free to override any methods of BaseSqoopTool class. If you want to enhance on top of any of the present in-built tools, it is better to extend your custom class to that specific tool class, so as to override specific methods of that tool class. For example, you can extend your custom class to that of Import tool class, if you are enhancing your requirements over the existing logic of Import tool.
  • Support for User-defined custom options: If you want to have custom arguments for your tool, the following 3 methods need to be overridden: (Examples of some of the existing command line arguments are: --connect <jdbcUrl> , --username <username> , --table <table name>)
    • public void configureOptions(ToolOptions toolOptions) : This method is responsible for configuring the command-line arguments you expect to receive for your custom tool. You can also specify the description of all these command line arguments. When a user executes Sqoop help <custom tool> , the information which is provided in this method will be output to the user.
    • public void applyOptions(CommandLine in, SqoopOptions out) : This method is responsible for parsing all the options and populating SqoopOptions which acts as a data transfer object during the complete execution.
    • public void validateOptions(SqoopOptions options) : This method is responsible for providing any validations required for your options.

Tip: Please note that user-defined custom options support is quite not in a working state in Sqoop v1.4.6 (even though a bug has been resolved in Sqoop v1.4.7 with regard to this, it is not available for us until CDH, HDP and MapR Hadoop Distros ship it with them). Instead, you will want to leverage the Sqoop generic arguments to pass in your custom values. The values for the generic arguments are set as configuration in Sqoop's data transfer object and so is available throughout all the classes in Sqoop. You can set the arguments as like: -D <key>=<value>. You do not need to leverage these if you do not have any custom values to pass to Sqoop.

3. Create any number of required tertiary classes

You can create as many classes as you want your custom tool class to be dependent on.

4. Create user-defined tool plugin class [Wrapper class]

Create a class with a name ideally ending with the word “Plugin” to provide the essence that it is the user-defined plugin class (for example, AbcPluginclass). This is the wrapper class that en-wraps the tool class and other dependency classes as the plugin.

What does the user-defined tool plugin class need to inherit and which methods are mandated to be overridden?

ToolPlugin is the base class for Plugin. So, your custom tool plugin class should extend from org.apache.sqoop.tool.ToolPlugin and override the getTools() method. The user-defined tool plugin class is basically needed to en-wrap the custom tool with it, as already mentioned aforehand. The plugin implementation after overriding the getTools() methods should look somewhat like:

public class AbcPlugin extends org.apache.sqoop.tool.ToolPlugin {
@Override
public List<ToolDesc>
getTools() {
                    return Collections
                    .singletonList(new ToolDesc(
                     "put-name-of-command-here",
                     AbcTool.class,
                     "Put description of the command here"));
}

5. Build the plugin

  • Resolve any third party dependencies that you might be using (for example, org.apache.commons, etc) by adding the dependencies onto your pom.xml file.
  • Build the plugin through Maven or an alternative, and make sure the required jar is generated (for example, abc-sqoop-plugin-1.0.0-SNAPSHOT.jar).

6. Register the user-defined plugin with Sqoop

This is the final step which involves registering the plugin class with Sqoop.

The following steps will make sure you accomplish this right:

  • Export the plugin jar: You need to copy your plugin jar to $SQOOP_LIB directory.
  • Register the plugin class with Sqoop: You need to register the plugin class with Sqoop by defining the value for the property sqoop.tool.plugins in the sqoop-site.xml file which is present inside the configuration directory of Sqoop. The value for the property sqoop.tool.plugins should be the name of the plugin class prefixed with the package name. The definition should look something like below:
  • <property>
       <name>sqoop.tool.plugins</name>

       <value>org.apache.sqoop.tool.AbcPlugin</value>
       <description>
    A comma-delimited list of ToolPlugin implementations which are consulted, in order, to register SqoopTool instances which allow third-party tools to be used.
       </description>
    </property>
  • The description just describes what the property 'sqoop.tool.plugin' does and is self-explanatory in the above example.
  • IMPORTANT - Verifying if the plugin got registered: After you are done with above steps, when you type 'sqoop help' in the command line you should see your plugins listed along with the inbuilt tools (commands).

Visualizing the Sqoop plugin

This diagram is provided for the easy visualization of the inter-relations among the classes that we require for the implementation of a Sqoop plugin.

 

 


Hacking into classes in a Sqoop Plugin (Optional)

The above steps we followed so far helps us create a custom plugin. Having a Manager Factory which allows us to use a Custom Connection Manager for an RDBMS is another optional powerful thing we can accomplish. This way the Sqoop plugin you create can also enhance Sqoop to do other things such as ‘listing schema’ if required (implementing a listing schema feature is beyond the scope of this article). 

To bring in your own custom connection managers:

  1. Sqoop has several built-in Connection managers classes such as NetezzaManager, PostgresqlManager and the like. All these managers more or less extend a base connection manager class called org.apache.sqoop.manager.ConnManager. Hence, The first thing we need to do to create a custom Connection Manager class is to extend this base connection manager class. Alternatively, if you wish to enhance over any of the existing Connection Manager classes such as org.apache.sqoop.manager.NetezzaManager, org.apache.sqoop.manager.PostgresqlManager, etc. then we can extend that specific class.

    Example:
    public class AbcNetezzaManager extends
    org.apache.sqoop.manager.NetezzaManager {
        //TODO: Introduce & implement methods as per requirements
    }
  2. Next you need to design your own Manager Factory class so as to instantiate the custom connection manager we created above. Existing connection managers are currently created by instances of the abstract class ManagerFactory. One ManagerFactory implementation serves all of Sqoop: org.apache.sqoop.manager.DefaultManagerFactory. ManagerFactory has a single method of note, named accept(). This method determines whether it can instantiate a ConnManager for the user’s SqoopOptions. If so, it returns the ConnManager instance. Otherwise, it returns null. So, for instantiating your own connection manager, you need to create a manager factory class that extends to the DefaultManagerFactory class and then, override the accept() method. Inside the accept() method, you need to define instance of your custom connection manager, something as like illustrated below:

     
    public class AbcManagerFactory extends DefaultManagerFactory {
    @Override
    public ConnManager accept(JobData data) {
         SqoopOptions sqoopOptions = data.getSqoopOptions();
         String scheme = extractScheme(sqoopOptions);
         if (scheme.startsWith("jdbc:netezza:")) {
              if (sqoopOptions.isDirect()) {
                return new DirectNetezzaManager(sqoopOptions);
              } else {
                //our created custom manager
                return new AbcNetezzaManager(sqoopOptions);
              }
         }
       }
    }

  3. The final step is to register your custom Manager Factory class [AbcManagerFactory] in the sqoop-site.xml file, present inside the conf directory of Sqoop. You need to add the property name sqoop.connection.factories inside the sqoop-site.xml, if not already present. The value for the property is meant to have the name of the Manager Factories that Sqoop needs to register with itself for instantiating the Connection Managers. Hence, you should add the name of your own custom Manager Factory in the value replacing the DefaultManagerFactory and keep the other Managers as is. This is illustrated below:

     
    <property>
              <name>
    sqoop.connection.factories</name>
    <value>
    org.apache.sqoop.manager.oracle.OraOopManagerFactory,
    abc.custom.manager.factory.AbcManagerFactory
              </value>
              <description>
    AbcManagerFactor is a custom manager that enhances DefaultManagerFactory provided by Sqoop for other functionalities
              </description>
    </property>

And you're done fabricating your own custom Connection Manager. Cool, isn't it?

Confirming if the plugin is successfully registered

If your plugin is successfully registered after following the above steps, when you type ‘sqoop help’ in the terminal, you’ll be able to see your plugin getting registered something like what’s being highlighted in below image. Here, we are assuming you have named your plugin ‘db-import’.

 

About the Author

Sunayan Saikia

Module Lead

More Content by Sunayan Saikia
Previous Article
What is a Cloud Data Lake?
What is a Cloud Data Lake?

Whether you're planning to start a data lake on AWS, Azure, Google Cloud or a combination of the three, it ...

Next Article
Easily Find your Data within the Zaloni Data Platform
Easily Find your Data within the Zaloni Data Platform

Have access to your data like never before with the Zaloni Data Platform 5.0 persistent global search.

Want a governed, self-service data lake?

Contact Us