Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

Dongjoon Hyun-2
Hi, All.

Vectorized ORC Reader is now supported in Apache Spark 2.3.

    https://issues.apache.org/jira/browse/SPARK-16060

It has been a long journey. From now, Spark can read ORC files faster without feature penalty.

Thank you for all your support, especially Wenchen Fan.

It's done by two commits.

    [SPARK-16060][SQL] Support Vectorized ORC Reader
    https://github.com/apache/spark/commit/f44ba910f58083458e1133502e193a9d6f2bf766

    [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized orc reader
    https://github.com/apache/spark/commit/eaac60a1e20e29084b7151ffca964cfaa5ba99d1

Please check OrcReadBenchmark for the final speed-up from `Hive built-in ORC` to `Native ORC Vectorized`.

Thank you.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

Dongjoon Hyun-2
Hi, Nicolas.

Yes. In Apache Spark 2.3, there are new sub-improvements for SPARK-20901 (Feature parity for ORC with Parquet).
For your questions, the following three are related.

1. spark.sql.orc.impl="native"
    By default, `native` ORC implementation (based on the latest ORC 1.4.1) is added.
    The old one is `hive` implementation.

2. spark.sql.orc.enableVectorizedReader="true"
    By default, `native` ORC implementation uses Vectorized Reader code path if possible.
    Please note that `Vectorization(Parquet/ORC) in Apache Spark` is only supported only for simple data types.

3. spark.sql.hive.convertMetastoreOrc=true
    Like Parquet, by default, Hive tables are converted into file-based data sources to use Vectorization technique.

Bests,
Dongjoon.



On Sun, Jan 28, 2018 at 4:15 AM, Nicolas Paris <[hidden email]> wrote:
Hi

Thanks for this work.

Will this affect both:
1) spark.read.format("orc").load("...")
2) spark.sql("select ... from my_orc_table_in_hive")

?


Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait :
> Hi, All.
>
> Vectorized ORC Reader is now supported in Apache Spark 2.3.
>
>     https://issues.apache.org/jira/browse/SPARK-16060
>
> It has been a long journey. From now, Spark can read ORC files faster without
> feature penalty.
>
> Thank you for all your support, especially Wenchen Fan.
>
> It's done by two commits.
>
>     [SPARK-16060][SQL] Support Vectorized ORC Reader
>     https://github.com/apache/spark/commit/f44ba910f58083458e1133502e193a
> 9d6f2bf766
>
>     [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized orc
> reader
>     https://github.com/apache/spark/commit/eaac60a1e20e29084b7151ffca964c
> faa5ba99d1
>
> Please check OrcReadBenchmark for the final speed-up from `Hive built-in ORC`
> to `Native ORC Vectorized`.
>
>     https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/
> apache/spark/sql/hive/orc/OrcReadBenchmark.scala
>
> Thank you.
>
> Bests,
> Dongjoon.