Why is spark.shuffle.sort.bypassMergeThreshold 200?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Why is spark.shuffle.sort.bypassMergeThreshold 200?

Jacek Laskowski
Hi,

I'm wondering what's so special about 200 to have it the default value
of spark.shuffle.sort.bypassMergeThreshold?

Is this arbitrary number? Is there any theory behind it?

Is the number of partitions in Spark SQL, i.e. 200, somehow related to
spark.shuffle.sort.bypassMergeThreshold?

scala> spark.range(5).groupByKey(_ % 5).count.rdd.getNumPartitions
res3: Int = 200

I'd appreciate any guidance to get the gist of this seemingly magic
number. Thanks!

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Why is spark.shuffle.sort.bypassMergeThreshold 200?

Liang-Chi Hsieh

This https://github.com/apache/spark/pull/1799 seems the first PR to introduce this number. But there is no explanation about the number.

Jacek Laskowski wrote
Hi,

I'm wondering what's so special about 200 to have it the default value
of spark.shuffle.sort.bypassMergeThreshold?

Is this arbitrary number? Is there any theory behind it?

Is the number of partitions in Spark SQL, i.e. 200, somehow related to
spark.shuffle.sort.bypassMergeThreshold?

scala> spark.range(5).groupByKey(_ % 5).count.rdd.getNumPartitions
res3: Int = 200

I'd appreciate any guidance to get the gist of this seemingly magic
number. Thanks!

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Why is spark.shuffle.sort.bypassMergeThreshold 200?

Kay Ousterhout
I believe that these two were indeed originally related.  In the old hash-based shuffle, we wrote objects out immediately to disk as they were generated by an RDD's iterator. On the other hand, with the original version of the new sort-based shuffle, Spark buffered a bunch of objects before writing them out to disk.  My vague memory is that this caused issues for Spark SQL -- I think because SQL got a performance improvement from re-using the same objects when generating data from the iterator (but if it re-used objects, the sort-based shuffle didn't work, because all of the buffered objects would incorrectly point to the same underlying object).  So, the default configuration was 200 so that SQL wouldn't use the sort-based shuffle.  My memory is that the issues around this have since been fixed but Michael / Reynold / Andrew Or probably have a better memory of this.

-Kay

On Wed, Dec 28, 2016 at 7:05 PM, Liang-Chi Hsieh <[hidden email]> wrote:

This https://github.com/apache/spark/pull/1799 seems the first PR to
introduce this number. But there is no explanation about the number.


Jacek Laskowski wrote
> Hi,
>
> I'm wondering what's so special about 200 to have it the default value
> of spark.shuffle.sort.bypassMergeThreshold?
>
> Is this arbitrary number? Is there any theory behind it?
>
> Is the number of partitions in Spark SQL, i.e. 200, somehow related to
> spark.shuffle.sort.bypassMergeThreshold?
>
> scala> spark.range(5).groupByKey(_ % 5).count.rdd.getNumPartitions
> res3: Int = 200
>
> I'd appreciate any guidance to get the gist of this seemingly magic
> number. Thanks!
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail:

> dev-unsubscribe@.apache





-----
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Why-is-spark-shuffle-sort-bypassMergeThreshold-200-tp20379p20389.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]