better compression codecs for shuffle blocks?

10 messages Options
better compression codecs for shuffle blocks? – Hi Spark devs, I was looking into the memory usage of shuffle and one annoying thing is the default compression codec (LZF) is that the implem...
We tried with lower block size for lzf, but it barfed all over the place. Snappy was the way to go for our jobs. Regards, Mridul On Mo...
Just a comment from the peanut gallery, but these buffers are a real PITA for us as well. Probably 75% of our non-user-error job failures are ...
Stephen, Often the shuffle is bound by writes to disk, so even if disks have enough space to store the uncompressed data, the shuffle can comple...
You can actually turn off shuffle compression by setting spark.shuffle.compress to false. Try that out, there will still be some buffers for the v...
Copying Jon here since he worked on the lzf library at Ning. Jon - any comments on this topic? On Mon, Jul 14, 2014 at 3:54 PM, Matei Zaha...
Maybe we could try LZ4 [1], which has better performance and smaller footprint than LZF and Snappy. In fast scan mode, the performance is 1.5 - 2...
Is the held memory due to just instantiating the LZFOutputStream? If so, I'm a surprised and I consider that a bug. I suspect the held memory...
One of the core problems here is the number of open streams we have, which is (# cores * # reduce partitions), which can easily climb into the te...
FYI dev, I submitted a PR making Snappy as the default compression codec: https://github.com/apache/spark/pull/1415 Also submitted a separa...