This works very well in all the scala-based code (spark2-shell,
spark-submit, etc..) but it doesn't for Pyspark, since I do see the
following warnings repeating over and over:
20/01/14 10:23:50 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
Attempted to request executors before the AM has registered!
20/01/14 10:23:50 WARN ExecutorAllocationManager: Unable to reach the
cluster manager to request 1 total executors!
The culprit seems to be the option "spark.io.encryption.enabled=true",
since without it everything works fine.
At first I thought that it was a Yarn resource allocation problem, but
then I checked and the cluster has plenty of space. After digging a
bit more into Yarn's container logs and I discovered that it seems a
problem related to the Application master not being able to contact
the Driver in time (assuming client mode of course):
20/01/14 09:45:21 INFO ApplicationMaster: ApplicationAttemptId:
20/01/14 09:45:21 INFO YarnRMClient: Registering the ApplicationMaster
20/01/14 09:45:52 ERROR TransportClientFactory: Exception while
bootstrapping client after 30120 ms
Timeout waiting for task.
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
... 11 more
The strange part is that sometimes the timeout doesn't occur, and
sometimes it does. I checked the code related to the above stacktrace
and ended up to:
The "spark.network.auth.rpcTimeout" option seems to help, even if it
is not advertised in the docs as far as I can see (the 30s mentioned
in the exception are definitely trigger by this setting though). What
I am wondering is where/what I should check to debug this further,
since it seems a Python only problem that doesn't affect Scala. I
didn't find any outstanding bugs, so given the fact that 2.4.4 is very
recent I thought to report it in here to ask for an advice :)
Thanks in advance!
To unsubscribe e-mail: [hidden email]