Apache DRILL - Troubleshooting ChannelClosedException
AIM :
To discuss few tuning parameters that can help to avoid "ChannelClosedException" errors in DRILL.
Please note that this a generic troubleshooting guide.
SYMPTOMS :
Following error messages reported during query failure could be becasue of "ChannelClosedException".
[1] o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: ChannelClosedException: Channel closed /<hostname1>:<port> <--> /<hostname2>:<port2>.
[2] SYSTEM ERROR: Drill Remote Exception
Case 1 : o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: ChannelClosedException: Channel closed /<hostname1>:<port> <--> /<hostname2>:<port2>.
This may be because of "OutOfMemory" issue in any one of the drillbit node. More troubleshooting quide and tuning details can be found my previoous blog.
https://bigdatainourpalm.blogspot.com/2015/10/apache-drill-troubleshooting.html
Case 2 : If you see logs similar to following, it will be mostly due to timeout issues at client side.
(i) o.a.d.exec.rpc.RpcExceptionHandler - Exception occurred with closed channel. Connection: /<hostname1>:<port> <--> /<hostname2>:<port2>. (user server)
(ii) o.a.d.exec.rpc.RpcExceptionHandler - Exception occurred with closed channel. Connection: /<hostname1>:<port> <--> /<hostname2>:<port2>. (user client)
More error stack is given below:
2015-04-26 04:30:50,430 [27856ea2-2b5c-7908-9e89-aef5eec27034:frag:8:147] INFO o.a.d.e.w.f.FragmentStatusReporter - 27856ea2-2b5c-7908-9e89-aef5eec27034:8:147: State to report: RUNNING
2015-04-26 04:31:11,825 [UserServer-1] INFO o.a.drill.exec.rpc.user.UserServer - RPC connection /XXX.YY.ZZ.YY:31010 <--> /XYX.YU.YZ.YY:33320 (user server) timed out. Timeout was set to 30 seconds. Closing connection.
2015-04-26 04:31:11,839 [27856ea2-2b5c-7908-9e89-aef5eec27034:frag:0:0] INFO o.a.d.e.w.fragment.FragmentExecutor - 2613806b-9b7b-1d29-3017-dab770388115:0:0: State change requested RUNNING --> FAILED
2015-04-26 04:31:11,841 [UserServer-1] INFO o.a.d.e.w.fragment.FragmentExecutor - 2667806b-9b7b-1d29-3017-dab770388115:0:0: State change requested FAILED --> FAILED
2015-04-26 04:31:11,841 [27856ea2-2b5c-7908-9e89-aef5eec27034:frag:0:0] INFO o.a.d.e.w.fragment.FragmentExecutor - 2667806b-9b7b-1d29-3017-dab770388115:0:0: State change requested FAILED --> FAILED
The key here is 'user server'/'user client' message in the logs.
The default timeout value is 30 seconds.
SOLUTION/WORKAROUND :
Increase the value of user client timeout. This can be done by setting 'drill.exec.user.timeout' property in 'drill-override.conf' file.
An example is given below (changing timeout to 300 seconds or 5 minutes):
cat drill-override.conf
drill.exec: {
cluster-id: "ajames-drillbits",
zk.connect: "<hn>:5181",
rpc.user.timeout: "300"
}
You need to restart drillbit services after making the change.
No comments:
Post a Comment