[DISCUSS] Spark cannot identify the problem executor
We've been using spark 2.3 with blacklist enabled and often meet the problem that when executor A has some problem(like connection issue). Tasks on executor B, executor C will fail saying cannot read from executor A. Finally the job will fail due to task on executor B failed 4 times.
I wonder whether there is any existing fix or discussions how to identify Executor A as the problem node.
Re: [DISCUSS] Spark cannot identify the problem executor
There is an existing way to handle this situation. Those tasks will become
zombie tasks  and they should not be counted into the tasks failures
. Even the shuffle blocks should be unregistered for the lost executor,
although the lost executor might be already cached as a mapoutput in the
other executors  which might generate new fetch failures.
Check the mentioned code parts and run Spark with debug enabled for this
classes to investigate this further. Reading the log and the looking the
code together will help you a lot. And consider using a fresh Spark as there
were changes in this area.
Important: You can avoid this problem altogether by using the external
If you happen to be on YARN, please check this link .
When the external shuffle service is enabled then shuffle blocks won't be
lost with the dying executor as the blocks can be served by the shuffle
service which is running on the same host where the executor was.