How to clear spark Shuffle files

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How to clear spark Shuffle files

lsn248
Hi,

 I have a long running application and spark seem to fill up the disk with
shuffle files.  Eventually the job fails running out of disk space. Is there
a way for me to clean the shuffle files ?

Thanks





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to clear spark Shuffle files

edeesis
We've also had some similar disk fill issues.

For Java/Scala RDDs, shuffle file cleanup is done as part of the JVM garbage collection. I've noticed that if RDDs maintain references in the code, and cannot be garbage collected, then immediate shuffle files hang around.

Best way to handle this is by organizing your code such that when an RDD is finished, it falls out of scope, and thus is able to be garbage collected.

There's also an experimental API created in Spark 3 (I think), that allows you to have more granular control by calling a method to clean up the shuffle files.

On Mon, Sep 14, 2020 at 11:02 AM lsn248 <[hidden email]> wrote:
Hi,

 I have a long running application and spark seem to fill up the disk with
shuffle files.  Eventually the job fails running out of disk space. Is there
a way for me to clean the shuffle files ?

Thanks





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to clear spark Shuffle files

Holden Karau
There's a second new mechanism which uses TTL for cleanup of shuffle files. Can you share more about your use case?

On Mon, Sep 14, 2020 at 1:33 PM Edward Mitchell <[hidden email]> wrote:
We've also had some similar disk fill issues.

For Java/Scala RDDs, shuffle file cleanup is done as part of the JVM garbage collection. I've noticed that if RDDs maintain references in the code, and cannot be garbage collected, then immediate shuffle files hang around.

Best way to handle this is by organizing your code such that when an RDD is finished, it falls out of scope, and thus is able to be garbage collected.

There's also an experimental API created in Spark 3 (I think), that allows you to have more granular control by calling a method to clean up the shuffle files.

On Mon, Sep 14, 2020 at 11:02 AM lsn248 <[hidden email]> wrote:
Hi,

 I have a long running application and spark seem to fill up the disk with
shuffle files.  Eventually the job fails running out of disk space. Is there
a way for me to clean the shuffle files ?

Thanks





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: How to clear spark Shuffle files

lsn248
Our use case is as follows:
   We repartition 6 months worth of data for each client on clientId &
recordcreationdate, so that it can write one file per partition. Our
partition is on client and recordcreationdate.

The job fills up the disk after it process say 30 tenants out of 50.  I am
looking for a way to clear the shuffle files once the jobs finishes writing
to the disk for a client before it moves on to next.

We process a client or group of clients (depends on data size) in one go,
sparksession is shared. We noticed that once you create a new sparksession
it clears the disk. But new sparksession is not a option for us.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]