Thursday 15 February 2018

Sqoop export job getting stuck and failing on task timeout


AIM:


Diagnose an issue where the user had a Sqoop job stuck for a long time and eventually getting failed.

Details:

In this particular case, the user was trying to export data from Hive table to Oracle using Sqoop. 
The job was getting stuck and failing after 40+ minutes.

Analyze the application log for the failed application. For this case, following exception was reported in the log:

2018-02-14 22:02:03,731 ERROR [Thread-7] org.apache.sqoop.mapreduce.AsyncSqlOutputFormat: Got exception in update thread: java.sql.SQLIntegrityConstraintViolationException: ORA-00001: unique constraint (ALWIN.SYS_C009282) violated

        at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:440)
        at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396)
        at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:837)
        at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:445)
        at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:191)
        at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:523)
        at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:207)
        at oracle.jdbc.driver.T4CPreparedStatement.executeForRows(T4CPreparedStatement.java:1010)
        at oracle.jdbc.driver.OraclePreparedStatement.executeForRowsWithTimeout(OraclePreparedStatement.java:10096)
        at oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:10200)
        at oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:216)
        at org.apache.sqoop.mapreduce.AsyncSqlOutputFormat$AsyncSqlExecThread.run(AsyncSqlOutputFormat.java:231)

The export failed because the source data contained duplicate values for a column with a UNIQUE constraint in Oracle.
In general, export will fail if any constraint is violated by the data.

The task is supposed to fail as soon as the above exception is encountered. However, the task went to an unresponsive state. Eventually, the task will be killed after the task timeout value is reached, which defaults to 10 minutes. After the first task is killed, application master will retry the task 3 more times before application is reported as failed. Each task will take 10 minutes (by default) to get timed out, hence adding to a total of 40 minutes.

Root Cause:

The root cause is a bug in Sqoop. In class AsyncSqlRecordWriter, if an exception is thrown in its close method, the Hadoop MapTask will call this close method once more in case it hasn't been closed. Following JIRA contains more information:
https://issues.apache.org/jira/browse/SQOOP-2343

Solution:

Although this is a bug in Sqoop which is fixed in Sqoop 1.4.7 version, ultimately the source data need to cleansed to be compatible with the constraints specified in the destination database. In this particular case, duplicate data for the specific column had to be filtered out from the source table. 

1 comment: