Quantcast

PySpark - ERROR scheduler.DAGScheduler: Failed to update accumulators

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

PySpark - ERROR scheduler.DAGScheduler: Failed to update accumulators

anshuljoshi
This post has NOT been accepted by the mailing list yet.

Working with dataframes and after multiple joins, I start getting such errors:

17/05/12 07:49:02 ERROR scheduler.DAGScheduler: Failed to update accumulators for ResultTask(6350, 747)
org.apache.spark.SparkException: EOF reached before Python server acknowledged
        at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:862)
        at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:820)
        at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:100)
        at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:351)
        at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:346)
        at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
        at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
        at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
        at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
        at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
        at org.apache.spark.Accumulators$.add(Accumulators.scala:346)
        at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1079)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1151)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Sometime I don't get error but it gets stuck.

I am working on the sample data (~1000 rows) with 20GB and 8 core machine.

The joins are like this:

DF01
│  
|
└───DF02
│   │
│   └───DF03
│       │  
│       │
│       |
└───────DF04
│       │
│       └───DF05
│       │─────└───────DF06
│       │
│       └───DF07
│       │
│       │
|       └───DF08
│       │
│       └───DF09
│       │
│       │
└───────DF10
│   │
│   └───

This goes on. It looks like a lineage problem to me. I have tried "coalesce" and "repartition".

I also went through this answer on spark lineage.

Any solution or suggestions are welcome.

http://stackoverflow.com/questions/44017495/pyspark-failed-to-update-accumulators
Loading...