Spark 3.0 and ORC 1.6

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark 3.0 and ORC 1.6

David Christle

Hi all,

 

I am a heavy user of Spark at LinkedIn, and am excited about the ZStandard compression option recently incorporated into ORC 1.6. I would love to explore using it for storing/querying of large (>10 TB) tables for my own disk I/O intensive workloads, and other users & companies may be interested in adopting ZStandard more broadly, since it seems to offer faster compression speeds at higher compression ratios with better multi-threaded support than zlib/Snappy. At scale, improvements of even ~10% on disk and/or compute, hopefully just from setting the “orc.compress” flag to a different value, could translate into palpable gains in capacity/cost cluster wide without requiring broad engineering migrations. See a somewhat recent FB Engineering blog post on the topic for their reported experiences: https://engineering.fb.com/core-data/zstandard/

 

Do we know if ORC 1.6.x will make the cut for Spark 3.0?

 

A recent PR (https://github.com/apache/spark/pull/26669) updated ORC to 1.5.8, but I don’t have a good understanding of how difficult incorporating ORC 1.6.x into Spark will be. For instance, in the PRs for enabling Java Zstd in ORC (https://github.com/apache/orc/pull/306 & https://github.com/apache/orc/pull/412), some additional work/discussion around Hadoop shims occurred to maintain compatibility across different versions of Hadoop (e.g. 2.7) and aircompressor (a library containing Java implementations of various compression codecs, so that dependence on Hadoop 2.9 is not required). Again, these may be non-issues, but I wanted to kindle discussion around whether this can make the cut for 3.0, since I imagine it’s a major upgrade many users will focus on migrating to once released.

 

Kind regards,

David Christle

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 and ORC 1.6

Dongjoon Hyun-2
Hi, David.

Thank you for sharing your opinion.
I'm also a supporter for ZStandard.

Apache Spark 3.0 starts to take advantage of ZStd a lot.

   1) Switch the default codec for MapOutputStatus from GZip to ZStd.
   2) Add spark.eventLog.compression.codec to allow ZStd.
   3) Use Parquet+ZStd easily for data storage off the shelf.
       (with Hadoop 3.2 pre-built distribution)

So, the last big missing piece is ORC, right.

As a PMC member of both Apache Spark and Apache ORC,
I've been trying to reduce those gaps of both communities
in order to maximize the synergy.

Historically,

   1) At Apache Spark 2.3.0, Apache Spark started to depend on Apache ORC 1.4.0 (SPARK-21422)
   2) At Apache Spark 2.4.0, Apache ORC 1.5.2 becomes the default ORC library. (SPARK-23456)
   3) At Apache Spark 3.0.0, Apache Spark embraced the breaking change of Apach ORC 1.5.6 (SPARK-28208)
      And, Apache Spark 3.0.0-preview2 catched up until ORC 1.5.8 while Spark 2.4.5 and 2.4.6 will stay with 1.5.5.

However, Apache ORC 1.5.9 RC0 also had another breaking change. Although we minimized the regression at 1.5.9 RC1, we still need to adapt for the followings. (Please see the ORC 1.5.9 RC0/RC1 vote email for the detail.)

   - Upgrade hive-storage-api upgrade from 2.6.0 to 2.7.1
   - Add new dependency `threeten-extra-1.5.0.jar`

The above breaking changes will be a subset of that of ORC 1.6.x. And, we need to validate any potential performance regression at 1.6.x, too. I hope we can use Apache ORC 1.5.9 as a stepping stone to reach Apache ORC 1.6.x

I'll create a PR to upgrade to Apache ORC 1.5.9 as soon as possible, but Spark community will decide whether it will be in or not in 3.0.0.

In short, given the circumstance, Apache Spark 3.1.0 will be a more safer release candidate for Apache ORC 1.6.x adoption. Spark 3.1.0 will arrive in six month after Spark 3.0.0. We have a release cadence. At that time, we can focus on ORC improvements with more references. (For now, Apache ORC 1.6 is not used at Apache Hive, either.)

Bests,
Dongjoon.


On Tue, Jan 28, 2020 at 12:41 PM David Christle <[hidden email]> wrote:

Hi all,

 

I am a heavy user of Spark at LinkedIn, and am excited about the ZStandard compression option recently incorporated into ORC 1.6. I would love to explore using it for storing/querying of large (>10 TB) tables for my own disk I/O intensive workloads, and other users & companies may be interested in adopting ZStandard more broadly, since it seems to offer faster compression speeds at higher compression ratios with better multi-threaded support than zlib/Snappy. At scale, improvements of even ~10% on disk and/or compute, hopefully just from setting the “orc.compress” flag to a different value, could translate into palpable gains in capacity/cost cluster wide without requiring broad engineering migrations. See a somewhat recent FB Engineering blog post on the topic for their reported experiences: https://engineering.fb.com/core-data/zstandard/

 

Do we know if ORC 1.6.x will make the cut for Spark 3.0?

 

A recent PR (https://github.com/apache/spark/pull/26669) updated ORC to 1.5.8, but I don’t have a good understanding of how difficult incorporating ORC 1.6.x into Spark will be. For instance, in the PRs for enabling Java Zstd in ORC (https://github.com/apache/orc/pull/306 & https://github.com/apache/orc/pull/412), some additional work/discussion around Hadoop shims occurred to maintain compatibility across different versions of Hadoop (e.g. 2.7) and aircompressor (a library containing Java implementations of various compression codecs, so that dependence on Hadoop 2.9 is not required). Again, these may be non-issues, but I wanted to kindle discussion around whether this can make the cut for 3.0, since I imagine it’s a major upgrade many users will focus on migrating to once released.

 

Kind regards,

David Christle