3 Hacks to Get the Most From Sqoop

June 22, 2017 Hiranya Deva Sarmah

When it comes to transferring huge amounts of bulk data between relational databases and Hadoop, Sqoop is a very effective tool. It also has a few issues. Here are some hacks that we’ve found useful for working around these issues we’ve seen with Sqoop.

Challenge #1: Importing tables from Oracle to Hive

We had to import a number of tables from Oracle database into Hive. In order to do that, we were using the options --hive-drop-import-delims or --hive-delims-replacement in our Sqoop commands to handle delimiters in data.

They worked perfectly until we encountered the data types CLOB, ROWID, BLOB and BFILE. The column of CLOB data type had the line feed/carriage return and ctrl-A characters as part of data.

Since the options for --hive-drop-import-delims or --hive-delims-replacement did not work in the Oracle database, the Sqoop process was detecting the line feed/carriage returns as new record markers, which resulted in issues with validations with regards to record count and schema integrity. The options for delimiter handling have seen to be working for other database sources, but in the Oracle database, it did not.

We saw similar issues with the ROWID data type as well. Again, doing a traditional import failed.

The workaround 

We use table based or query based import in the Sqoop import tool to import the data into HDFS. In this scenario, we used query based import to bring the data in.

To bring in the CLOB/ROWID columns, we did some additional character operations on the corresponding column data.

In the query that is provided in the Sqoop tool, all the columns of the table were selected, but instead of select *, a more precise select col1, col2.. was used. This allowed us to deal with the particular column of CLOB, ROWID data types and do necessary character replacement operations.

e.g.
Let EMPLOYEE(Name String, Position String, Current_Project CLOB, RowIdentifier ROWID) be the table in question.

Now, the query that we executed for sqooping the EMPLOYEE table from Oracle to Hive is as mentioned below:

sqoop import $HADOOP_OPTS \
--options-file $OPTIONS_FILE \
--query "${SQL} where \$CONDITIONS" \
--mapreduce-job-name "${TBL}_LOAD" \
--target-dir $TARGET_DIR/$TABLE_UPPER \
--fields-terminated-by '\001' \
--null-string "" \
--null-non-string "" \
--escaped-by "\\" \
--split-by $SPLIT_BY_COLUMN \
--num-mappers $NUM_MAPPERS \
--outdir $SCRIPT_PATH/javaClasses

Where

SQL=SELECT Name, Position, REPLACE(REPLACE(REPLACE(Current_Project, CHR(10), ' _ '), CHR(13), ' _ '), CHR(01), ' _ ') AS Current_Project, REPLACE(REPLACE(REPLACE(RowIdentifier,CHR(10), ' _ '), CHR(13), ' _ '), CHR(01), ' _ ') AS RowIdentifier FROM $SCHEMA.EMPLOYEE

CHR(10) -- Line feed
CHR(13) -- Carriage return.

In the particular column, each "problematic" character was treated (replaced) one by one with a character of our choice, in this case, the Underscore ( _ ).

Challenge #2: Handling BLOB & BFILE Columns

Columns of these two data types can not be imported into HDFS using the Sqoop import without losing or altering the data. The BLOB columns can be imported into HDFS; however the content is in a hexadecimal format. The BFILE columns are not supported in the Sqoop import tool.

The workaround 

We can use a table-based or query-based import in the Sqoop import tool to import the data into HDFS. In this workaround, we will use a query-based import to bring the data in.

As of now, we are skipping any columns with the data types BLOB or BFILE.

In the query that is provided in the Sqoop import tool, all the columns of the table were selected, but instead of select *, a more precise select col1, col2.. was used. This allowed us to deal with the particular column of BLOB, BFILE data types and skip them accordingly. The BLOB types should be possible to handle by adding an additional data loading operation after the Sqoop import is done.

e.g.
Let EMPLOYEE(Name String, Position String, Current_Project BLOB, Project_Documents BFILE) be the table in question.

Now, the query that we executed for sqooping the EMPLOYEE table from Oracle to Hive is as mentioned below:

sqoop import $HADOOP_OPTS \
--options-file $OPTIONS_FILE \
--query "${SQL} where \$CONDITIONS" \
--mapreduce-job-name "${TBL}_LOAD" \
--target-dir $TARGET_DIR/$TABLE_UPPER \
--fields-terminated-by '\001' \
--null-string "" \
--null-non-string "" \
--escaped-by "\\" \
--split-by $SPLIT_BY_COLUMN \
--num-mappers $NUM_MAPPERS \
--outdir $SCRIPT_PATH/javaClasses

Where

SQL=SELECT Name, Position FROM $SCHEMA.EMPLOYEE

Challenge #3: Using --split-by

We use Split-by and more than one mappers wherever possible to speed up the import process. However, there are certain data types that we have found that are not applicable for split-by.

To date for this example, the following types are not applicable for Split-by: text, varchar, string, nvarchar, uniqueidentifier. 

The workaround

As a solution to this, we have made sure that we use single mappers in such cases. Since our workflows use generic formats for all ingestion, we've written some custom code to manipulate the Split-by and Mapper count for each ingestion before Sqoop is triggered.

Hence, if the split-by column provided is of any of the identified data types, we must alter the mapper count for the particular table's ingestion to one (1).


These are only a few of the issues with Sqoop to date that we’ve discovered elegant workarounds for. Subscribe to Zaloni Resources for future posts about workarounds for big data and Hadoop data lakes.

About the Author

Hiranya Deva Sarmah

Senior Software Engineer

More Content by Hiranya Deva Sarmah
Previous Article
4 Top Data Governance Observations from DGIQ 2017
4 Top Data Governance Observations from DGIQ 2017

Data governance is a huge, important, complex topic, but absolutely one worth taking the effort to consider...

Next Article
Governed Data Lakes and the Race to GDPR Compliance
Governed Data Lakes and the Race to GDPR Compliance

Integrating a governed data lake into an existing data environment is a smart, practical way for organizati...

Want a governed, self-service data lake?

Contact Us