Quantcast

Faster Spark on ORC with Apache ORC

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Faster Spark on ORC with Apache ORC

Dong Joon Hyun
Hi, All.

Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive dependency.

With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and get some benefits.

    - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which means full vectorization support.

    - Stability: Apache ORC 1.4.0 already has many fixes and we can depend on ORC community effort in the future.

    - Usability: Users can use `ORC` data sources without hive module (-Phive)

    - Maintainability: Reduce the Hive dependency and eventually remove some old legacy code from `sql/hive` module.

As a first step, I made a PR adding a new ORC data source into `sql/core` module.


Could you give some opinions on this approach?

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Faster Spark on ORC with Apache ORC

Dong Joon Hyun
Hi,

I have been wondering how much Apache Spark 2.2.0 will be improved more again.

This is the prior record from the source code.


    Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
    SQL Single Int Column Scan:         Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    SQL Parquet Vectorized                    215 /  262         73.0          13.7       1.0X
    SQL Parquet MR                           1946 / 2083          8.1         123.7       0.1X


So, I got a similar (but slower) machine and ran ParquetReadBenchmark on it.

Apache Spark seems to be improved much again. But strangely, MR version is improved much more in general.


    Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
    Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

    SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    SQL Parquet Vectorized                         102 /  123        153.7           6.5       1.0X
    SQL Parquet MR                                 409 /  436         38.5          26.0       0.3X



For ORC, my PR ( https://github.com/apache/spark/pull/17924 ) looks like the following.


    Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
    Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

    SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    SQL ORC Vectorized                             147 /  153        107.3           9.3       1.0X
    SQL ORC MR                                     338 /  369         46.5          21.5       0.4X
    HIVE ORC MR                                    408 /  424         38.6          25.9       0.4X


Given that this is an initial PR without optimization, ORC Vectorization seems to catch up much.


Bests,
Dongjoon.


From: Dongjoon Hyun <[hidden email]>
Date: Tuesday, May 9, 2017 at 6:15 PM
To: "[hidden email]" <[hidden email]>
Subject: Faster Spark on ORC with Apache ORC

Hi, All.

Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive dependency.

With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and get some benefits.

    - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which means full vectorization support.

    - Stability: Apache ORC 1.4.0 already has many fixes and we can depend on ORC community effort in the future.

    - Usability: Users can use `ORC` data sources without hive module (-Phive)

    - Maintainability: Reduce the Hive dependency and eventually remove some old legacy code from `sql/hive` module.

As a first step, I made a PR adding a new ORC data source into `sql/core` module.


Could you give some opinions on this approach?

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Faster Spark on ORC with Apache ORC

Dong Joon Hyun
Hi, All.

As a continuation of SPARK-20682(Support a new faster ORC data source based on Apache ORC), I would like to suggest to make the default ORCFileFormat configurable between sql/hive and sql/core for the followings.

    spark.read.orc(...)
    spark.write.orc(...)

    CREATE TABLE t
    USING ORC
    ...

It's filed as SPARK-20728 and I made a PR for that, too.

In the new PR,

    - You can test not only the PR but also your apps more easily with that option.
    - To help reviews, the PR includes the updated benchmarks for both ORCReadBenchmark and ParquetReadBenchmark.

Since the previous PR is on-going, new PR inevitably have some of the previous PR.
I'll remove the duplication later in any ways.

Any opinions for Spark ORC improvement are welcome!

Thanks,
Dongjoon.​




From: Dong Joon Hyun <[hidden email]>
Sent: Friday, May 12, 2017 10:49 AM
To: [hidden email]
Subject: Re: Faster Spark on ORC with Apache ORC
 
Hi,

I have been wondering how much Apache Spark 2.2.0 will be improved more again.

This is the prior record from the source code.


    Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
    SQL Single Int Column Scan:         Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    SQL Parquet Vectorized                    215 /  262         73.0          13.7       1.0X
    SQL Parquet MR                           1946 / 2083          8.1         123.7       0.1X


So, I got a similar (but slower) machine and ran ParquetReadBenchmark on it.

Apache Spark seems to be improved much again. But strangely, MR version is improved much more in general.


    Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
    Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

    SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    SQL Parquet Vectorized                         102 /  123        153.7           6.5       1.0X
    SQL Parquet MR                                 409 /  436         38.5          26.0       0.3X



For ORC, my PR ( https://github.com/apache/spark/pull/17924 ) looks like the following.


    Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
    Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

    SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    SQL ORC Vectorized                             147 /  153        107.3           9.3       1.0X
    SQL ORC MR                                     338 /  369         46.5          21.5       0.4X
    HIVE ORC MR                                    408 /  424         38.6          25.9       0.4X


Given that this is an initial PR without optimization, ORC Vectorization seems to catch up much.


Bests,
Dongjoon.


From: Dongjoon Hyun <[hidden email]>
Date: Tuesday, May 9, 2017 at 6:15 PM
To: "[hidden email]" <[hidden email]>
Subject: Faster Spark on ORC with Apache ORC

Hi, All.

Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive dependency.

With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and get some benefits.

    - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which means full vectorization support.

    - Stability: Apache ORC 1.4.0 already has many fixes and we can depend on ORC community effort in the future.

    - Usability: Users can use `ORC` data sources without hive module (-Phive)

    - Maintainability: Reduce the Hive dependency and eventually remove some old legacy code from `sql/hive` module.

As a first step, I made a PR adding a new ORC data source into `sql/core` module.


Could you give some opinions on this approach?

Bests,
Dongjoon.
Loading...