Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

Dongjoon Hyun-2
Hi, All.

Vectorized ORC Reader is now supported in Apache Spark 2.3.

    https://issues.apache.org/jira/browse/SPARK-16060

It has been a long journey. From now, Spark can read ORC files faster without feature penalty.

Thank you for all your support, especially Wenchen Fan.

It's done by two commits.

    [SPARK-16060][SQL] Support Vectorized ORC Reader
    https://github.com/apache/spark/commit/f44ba910f58083458e1133502e193a9d6f2bf766

    [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized orc reader
    https://github.com/apache/spark/commit/eaac60a1e20e29084b7151ffca964cfaa5ba99d1

Please check OrcReadBenchmark for the final speed-up from `Hive built-in ORC` to `Native ORC Vectorized`.

Thank you.

Bests,
Dongjoon.