Quantcast

Why is shuffle data always persisted to disk?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Why is shuffle data always persisted to disk?

Effi Ofer
Greetings, I was wondering why Spark's Shuffler always persists the shuffle data to disk?  I understand that the persisted data can be used by the scheduler to truncate the lineage of the RDD graph if an existing RDD has been materialized as a side effect of an earlier shuffle.  But that does not explain why Spark is not keeping the shuffle RDD in memory until memory becomes sufficiently low to trigger victim selection and spilling.  Any hints and pointers would be appreciated.

Thanks,
Effi

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Why is shuffle data always persisted to disk?

Kay Ousterhout
There's been some discussion of this on this JIRA and the associated PR.  The short summary is that, in theory / to a VERY rough approximation, the OS buffer cache does everything we'd want an in-memory shuffle to do, and is simple.

On Wed, Mar 29, 2017 at 3:19 AM, Effi Ofer <[hidden email]> wrote:
Greetings, I was wondering why Spark's Shuffler always persists the shuffle data to disk?  I understand that the persisted data can be used by the scheduler to truncate the lineage of the RDD graph if an existing RDD has been materialized as a side effect of an earlier shuffle.  But that does not explain why Spark is not keeping the shuffle RDD in memory until memory becomes sufficiently low to trigger victim selection and spilling.  Any hints and pointers would be appreciated.

Thanks,
Effi


Loading...