Monday 30 October 2017

Apache DRILL - Troubleshooting ChannelClosedException 


AIM : 

To discuss few tuning parameters that can help to avoid "ChannelClosedException" errors in DRILL.
Please note that this a generic troubleshooting guide.

SYMPTOMS :

Following error messages reported during query failure could be becasue of "ChannelClosedException".
[1] o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: ChannelClosedException: Channel closed /<hostname1>:<port> <--> /<hostname2>:<port2>.
[2] SYSTEM ERROR: Drill Remote Exception

Case 1 : o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: ChannelClosedException: Channel closed /<hostname1>:<port> <--> /<hostname2>:<port2>.
This may be because of "OutOfMemory" issue in any one of the drillbit node. More troubleshooting quide and tuning details can be found my previoous blog.
https://bigdatainourpalm.blogspot.com/2015/10/apache-drill-troubleshooting.html

Case 2 : If you see logs similar to following, it will be mostly due to timeout issues at client side. 
(i)  o.a.d.exec.rpc.RpcExceptionHandler - Exception occurred with closed channel.  Connection: /<hostname1>:<port> <--> /<hostname2>:<port2>. (user server)
(ii) o.a.d.exec.rpc.RpcExceptionHandler - Exception occurred with closed channel.  Connection: /<hostname1>:<port> <--> /<hostname2>:<port2>. (user client)

More error stack is given below:

2015-04-26 04:30:50,430 [27856ea2-2b5c-7908-9e89-aef5eec27034:frag:8:147] INFO  o.a.d.e.w.f.FragmentStatusReporter - 27856ea2-2b5c-7908-9e89-aef5eec27034:8:147: State to report: RUNNING
2015-04-26 04:31:11,825 [UserServer-1] INFO  o.a.drill.exec.rpc.user.UserServer - RPC connection /XXX.YY.ZZ.YY:31010 <--> /XYX.YU.YZ.YY:33320 (user server) timed out.  Timeout was set to 30 seconds. Closing connection.
2015-04-26 04:31:11,839 [27856ea2-2b5c-7908-9e89-aef5eec27034:frag:0:0] INFO  o.a.d.e.w.fragment.FragmentExecutor - 2613806b-9b7b-1d29-3017-dab770388115:0:0: State change requested RUNNING --> FAILED
2015-04-26 04:31:11,841 [UserServer-1] INFO  o.a.d.e.w.fragment.FragmentExecutor - 2667806b-9b7b-1d29-3017-dab770388115:0:0: State change requested FAILED --> FAILED
2015-04-26 04:31:11,841 [27856ea2-2b5c-7908-9e89-aef5eec27034:frag:0:0] INFO  o.a.d.e.w.fragment.FragmentExecutor - 2667806b-9b7b-1d29-3017-dab770388115:0:0: State change requested FAILED --> FAILED

The key here is 'user server'/'user client' message in the logs.
The default timeout value is 30 seconds. 

SOLUTION/WORKAROUND :

Increase the value of user client timeout. This can be done by setting 'drill.exec.user.timeout' property in 'drill-override.conf' file.
An example is given below (changing timeout to 300 seconds or 5 minutes):

cat drill-override.conf

drill.exec: {
cluster-id: "ajames-drillbits",
zk.connect: "<hn>:5181",
rpc.user.timeout: "300"
}

You need to restart drillbit services after making the change.

No comments:

Post a Comment