Fwd: Spark 2.4.4, RPC encryption and Python

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Spark 2.4.4, RPC encryption and Python

Luca Toscano
Hi everybody,

trying to ask the same question in dev@ since it would be helpful some
info about how to debug this :)

Thanks in advance,

Luca

---------- Forwarded message ---------
Da: Luca Toscano <[hidden email]>
Date: gio 16 gen 2020 alle ore 09:16
Subject: Spark 2.4.4, RPC encryption and Python
To: <[hidden email]>


Hi everybody,

I am currently testing Spark 2.4.4 with the following new settings:

spark.authenticate   true
spark.io.encryption.enabled   true
spark.io.encryption.keySizeBits   256
spark.io.encryption.keygen.algorithm   HmacSHA256
spark.network.crypto.enabled   true
spark.network.crypto.keyFactoryAlgorithm   PBKDF2WithHmacSHA256
spark.network.crypto.keyLength   256
spark.network.crypto.saslFallback   false

I use dynamic allocation and the Spark shuffler is set correctly in
Yarn. I added the following two options to yarn-site.xml's config:

  <property>
      <name>spark.authenticate</name>
      <value>true</value>
  </property>

  <property>
      <name>spark.network.crypto.enabled</name>
      <value>true</value>
  </property>

This works very well in all the scala-based code (spark2-shell,
spark-submit, etc..) but it doesn't for Pyspark, since I do see the
following warnings repeating over and over:

20/01/14 10:23:50 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
Attempted to request executors before the AM has registered!
20/01/14 10:23:50 WARN ExecutorAllocationManager: Unable to reach the
cluster manager to request 1 total executors!

The culprit seems to be the option "spark.io.encryption.enabled=true",
since without it everything works fine.

At first I thought that it was a Yarn resource allocation problem, but
then I checked and the cluster has plenty of space. After digging a
bit more into Yarn's container logs and I discovered that it seems a
problem related to the Application master not being able to contact
the Driver in time (assuming client mode of course):

20/01/14 09:45:21 INFO ApplicationMaster: ApplicationAttemptId:
appattempt_1576771377404_19608_000001
20/01/14 09:45:21 INFO YarnRMClient: Registering the ApplicationMaster
20/01/14 09:45:52 ERROR TransportClientFactory: Exception while
bootstrapping client after 30120 ms
java.lang.RuntimeException: java.util.concurrent.TimeoutException:
Timeout waiting for task.
        at org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
        at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:263)
        at org.apache.spark.network.crypto.AuthClientBootstrap.doSparkAuth(AuthClientBootstrap.java:105)
        at org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:79)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:257)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
        at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
        at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276)
        at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
        at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:259)
        ... 11 more

The strange part is that sometimes the timeout doesn't occur, and
sometimes it does. I checked the code related to the above stacktrace
and ended up to:

https://github.com/apache/spark/blob/branch-2.4/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthClientBootstrap.java#L106
https://github.com/apache/spark/blob/branch-2.4/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L129-L133

The "spark.network.auth.rpcTimeout" option seems to help, even if it
is not advertised in the docs as far as I can see (the 30s mentioned
in the exception are definitely trigger by this setting though). What
I am wondering is where/what I should check to debug this further,
since it seems a Python only problem that doesn't affect Scala. I
didn't find any outstanding bugs, so given the fact that 2.4.4 is very
recent I thought to report it in here to ask for an advice :)

Thanks in advance!

Luca

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]