Query /Bug Spark Streaming / Context Cleaner/ GC question

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Query /Bug Spark Streaming / Context Cleaner/ GC question

Tarun

Hi spark-community,

 

Can someone please suggest on below question related to spark streaming query / context cleaner / garbage collection issue we are facing.  We suspect it’s bug causing memory leak.

 

We have a spark 2.3 cluster running a streaming query. We are observing behavior that no matter how much memory we allocate to executor, JVM heap eventually grows to the limit and the JVM's GC starts to cause frequent timeouts. Eventually the executor is marked "lost" or "dead". GC logging is enabled and it takes about 30-45 min to fill the heap. After that full GCs become much more frequent. We tried to increase more memory, gc interval and other relevant parameters of memory but have been observing same issue.

 

We enabled context cleaner debug logs and observe only broadcast/Accumulator related cleaning messages. We don't see RDDs being received for cleanup with message  “Cleaning RDD..” (Ref: this code ContextCleaner.scala#L213) . I have attached context cleaner logs for reference as well.

 

 

2020-09-14 20:00:12,270: DEBUG [org.apache.spark.ContextCleaner] Cleaned broadcast 538

2020-09-14 20:00:12,271: DEBUG [org.apache.spark.ContextCleaner] Got cleaning task CleanBroadcast(540)

2020-09-14 20:00:12,271: DEBUG [org.apache.spark.ContextCleaner] Cleaning broadcast 540

2020-09-14 20:00:21,915: DEBUG [org.apache.spark.ContextCleaner] Cleaned broadcast 540

2020-09-14 20:00:21,915: DEBUG [org.apache.spark.ContextCleaner] Got cleaning task CleanBroadcast(536)

2020-09-14 20:00:21,915: DEBUG [org.apache.spark.ContextCleaner] Cleaning broadcast 536

2020-09-14 20:00:21,922: DEBUG [org.apache.spark.ContextCleaner] Cleaned broadcast 536

2020-09-14 20:00:21,922: DEBUG [org.apache.spark.ContextCleaner] Got cleaning task CleanBroadcast(537)

2020-09-14 20:00:21,922: DEBUG [org.apache.spark.ContextCleaner] Cleaning broadcast 537

2020-09-14 20:00:21,926: DEBUG [org.apache.spark.ContextCleaner] Cleaned broadcast 537

2020-09-14 20:00:21,926: DEBUG [org.apache.spark.ContextCleaner] Got cleaning task CleanAccum(14783)

2020-09-14 20:00:21,926: DEBUG [org.apache.spark.ContextCleaner] Cleaning accumulator 14783

 

We see there is plenty of executor storage memory available shown in attached screenshot.

 

Any inputs or suggestions would be very appreciated!!

 

Thanks

Tarun

  



---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

debug_log.txt (430K) Download Attachment
executors.png (192K) Download Attachment
info_log.txt (432K) Download Attachment