[DISCUSS] Spark cannot identify the problem executor

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view

[DISCUSS] Spark cannot identify the problem executor

Hello all,

We've been using spark 2.3 with blacklist enabled and  often meet the problem that when executor A has some problem(like connection issue). Tasks on executor B, executor C will fail saying cannot read from executor A. Finally the job will fail due to task on executor B failed 4 times. 

I wonder whether there is any existing fix or discussions how to identify Executor A as the problem node.

Reply | Threaded
Open this post in threaded view

Re: [DISCUSS] Spark cannot identify the problem executor


There is an existing way to handle this situation. Those tasks will become
zombie tasks [1] and they  should not be counted into the tasks failures
[2]. Even the shuffle blocks should be unregistered for the lost executor,
although the lost executor might be already cached as a mapoutput in the
other executors [3] which might generate new fetch failures.

Check the mentioned code parts and run Spark with debug enabled for this
classes to investigate this further. Reading the log and the looking the
code together will help you a lot. And consider using a fresh Spark as there
were changes in this area.

Important: You can avoid this problem altogether by using the external
shuffle service.
If you happen to be on YARN, please check this link [4].

When the external shuffle service is enabled then shuffle blocks won't be
lost with the dying executor as the blocks can be served by the shuffle
service which is running on the same host where the executor was.

Best Regards,






Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

To unsubscribe e-mail: [hidden email]