Faster Spark on ORC with Apache ORC

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Faster Spark on ORC with Apache ORC

Dong Joon Hyun
Hi, All.

Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive dependency.

With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and get some benefits.

    - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which means full vectorization support.

    - Stability: Apache ORC 1.4.0 already has many fixes and we can depend on ORC community effort in the future.

    - Usability: Users can use `ORC` data sources without hive module (-Phive)

    - Maintainability: Reduce the Hive dependency and eventually remove some old legacy code from `sql/hive` module.

As a first step, I made a PR adding a new ORC data source into `sql/core` module.


Could you give some opinions on this approach?

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Faster Spark on ORC with Apache ORC

Dong Joon Hyun
Hi,

I have been wondering how much Apache Spark 2.2.0 will be improved more again.

This is the prior record from the source code.


    Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
    SQL Single Int Column Scan:         Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    SQL Parquet Vectorized                    215 /  262         73.0          13.7       1.0X
    SQL Parquet MR                           1946 / 2083          8.1         123.7       0.1X


So, I got a similar (but slower) machine and ran ParquetReadBenchmark on it.

Apache Spark seems to be improved much again. But strangely, MR version is improved much more in general.


    Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
    Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

    SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    SQL Parquet Vectorized                         102 /  123        153.7           6.5       1.0X
    SQL Parquet MR                                 409 /  436         38.5          26.0       0.3X



For ORC, my PR ( https://github.com/apache/spark/pull/17924 ) looks like the following.


    Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
    Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

    SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    SQL ORC Vectorized                             147 /  153        107.3           9.3       1.0X
    SQL ORC MR                                     338 /  369         46.5          21.5       0.4X
    HIVE ORC MR                                    408 /  424         38.6          25.9       0.4X


Given that this is an initial PR without optimization, ORC Vectorization seems to catch up much.


Bests,
Dongjoon.


From: Dongjoon Hyun <[hidden email]>
Date: Tuesday, May 9, 2017 at 6:15 PM
To: "[hidden email]" <[hidden email]>
Subject: Faster Spark on ORC with Apache ORC

Hi, All.

Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive dependency.

With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and get some benefits.

    - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which means full vectorization support.

    - Stability: Apache ORC 1.4.0 already has many fixes and we can depend on ORC community effort in the future.

    - Usability: Users can use `ORC` data sources without hive module (-Phive)

    - Maintainability: Reduce the Hive dependency and eventually remove some old legacy code from `sql/hive` module.

As a first step, I made a PR adding a new ORC data source into `sql/core` module.


Could you give some opinions on this approach?

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Faster Spark on ORC with Apache ORC

Dong Joon Hyun
Hi, All.

As a continuation of SPARK-20682(Support a new faster ORC data source based on Apache ORC), I would like to suggest to make the default ORCFileFormat configurable between sql/hive and sql/core for the followings.

    spark.read.orc(...)
    spark.write.orc(...)

    CREATE TABLE t
    USING ORC
    ...

It's filed as SPARK-20728 and I made a PR for that, too.

In the new PR,

    - You can test not only the PR but also your apps more easily with that option.
    - To help reviews, the PR includes the updated benchmarks for both ORCReadBenchmark and ParquetReadBenchmark.

Since the previous PR is on-going, new PR inevitably have some of the previous PR.
I'll remove the duplication later in any ways.

Any opinions for Spark ORC improvement are welcome!

Thanks,
Dongjoon.​




From: Dong Joon Hyun <[hidden email]>
Sent: Friday, May 12, 2017 10:49 AM
To: [hidden email]
Subject: Re: Faster Spark on ORC with Apache ORC
 
Hi,

I have been wondering how much Apache Spark 2.2.0 will be improved more again.

This is the prior record from the source code.


    Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
    SQL Single Int Column Scan:         Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    SQL Parquet Vectorized                    215 /  262         73.0          13.7       1.0X
    SQL Parquet MR                           1946 / 2083          8.1         123.7       0.1X


So, I got a similar (but slower) machine and ran ParquetReadBenchmark on it.

Apache Spark seems to be improved much again. But strangely, MR version is improved much more in general.


    Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
    Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

    SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    SQL Parquet Vectorized                         102 /  123        153.7           6.5       1.0X
    SQL Parquet MR                                 409 /  436         38.5          26.0       0.3X



For ORC, my PR ( https://github.com/apache/spark/pull/17924 ) looks like the following.


    Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
    Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

    SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    SQL ORC Vectorized                             147 /  153        107.3           9.3       1.0X
    SQL ORC MR                                     338 /  369         46.5          21.5       0.4X
    HIVE ORC MR                                    408 /  424         38.6          25.9       0.4X


Given that this is an initial PR without optimization, ORC Vectorization seems to catch up much.


Bests,
Dongjoon.


From: Dongjoon Hyun <[hidden email]>
Date: Tuesday, May 9, 2017 at 6:15 PM
To: "[hidden email]" <[hidden email]>
Subject: Faster Spark on ORC with Apache ORC

Hi, All.

Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive dependency.

With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and get some benefits.

    - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which means full vectorization support.

    - Stability: Apache ORC 1.4.0 already has many fixes and we can depend on ORC community effort in the future.

    - Usability: Users can use `ORC` data sources without hive module (-Phive)

    - Maintainability: Reduce the Hive dependency and eventually remove some old legacy code from `sql/hive` module.

As a first step, I made a PR adding a new ORC data source into `sql/core` module.


Could you give some opinions on this approach?

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Faster Spark on ORC with Apache ORC

Dong Joon Hyun
In reply to this post by Dong Joon Hyun

Hi, All.

 

Since Apache Spark 2.2 vote passed successfully last week,

I think it’s a good time for me to ask your opinions again about the following PR.

 

https://github.com/apache/spark/pull/17980  (+3,887, −86)

 

It’s for the following issues.

 

  • SPARK-20728: Make ORCFileFormat configurable between sql/hive and sql/core 
  • SPARK-20682: Support a new faster ORC data source based on Apache ORC

 

Basically, the approach is trying to use the latest Apache ORC 1.4.0 officially.

You can switch between the legacy ORC data source and new ORC datasource.

 

Could you help me to progress this in order to improve Apache Spark 2.3?

 

Bests,

Dongjoon.

 

From: Dong Joon Hyun <[hidden email]>
Date: Tuesday, May 9, 2017 at 6:15 PM
To: "[hidden email]" <[hidden email]>
Subject: Faster Spark on ORC with Apache ORC

 

Hi, All.

 

Apache Spark always has been a fast and general engine, and

since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive dependency.

 

With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and get some benefits.

 

    - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which means full vectorization support.

 

    - Stability: Apache ORC 1.4.0 already has many fixes and we can depend on ORC community effort in the future.

 

    - Usability: Users can use `ORC` data sources without hive module (-Phive)

 

    - Maintainability: Reduce the Hive dependency and eventually remove some old legacy code from `sql/hive` module.

 

As a first step, I made a PR adding a new ORC data source into `sql/core` module.

 

 

Could you give some opinions on this approach?

 

Bests,

Dongjoon.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Faster Spark on ORC with Apache ORC

Jeff Zhang

Awesome, Dong Joon, It's a great improvement. Looking forward its merge.





Dong Joon Hyun <[hidden email]>于2017年7月12日周三 上午6:53写道:

Hi, All.

 

Since Apache Spark 2.2 vote passed successfully last week,

I think it’s a good time for me to ask your opinions again about the following PR.

 

https://github.com/apache/spark/pull/17980  (+3,887, −86)

 

It’s for the following issues.

 

  • SPARK-20728: Make ORCFileFormat configurable between sql/hive and sql/core 
  • SPARK-20682: Support a new faster ORC data source based on Apache ORC

 

Basically, the approach is trying to use the latest Apache ORC 1.4.0 officially.

You can switch between the legacy ORC data source and new ORC datasource.

 

Could you help me to progress this in order to improve Apache Spark 2.3?

 

Bests,

Dongjoon.

 

From: Dong Joon Hyun <[hidden email]>


Date: Tuesday, May 9, 2017 at 6:15 PM
To: "[hidden email]" <[hidden email]>
Subject: Faster Spark on ORC with Apache ORC

 

Hi, All.

 

Apache Spark always has been a fast and general engine, and

since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive dependency.

 

With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and get some benefits.

 

    - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which means full vectorization support.

 

    - Stability: Apache ORC 1.4.0 already has many fixes and we can depend on ORC community effort in the future.

 

    - Usability: Users can use `ORC` data sources without hive module (-Phive)

 

    - Maintainability: Reduce the Hive dependency and eventually remove some old legacy code from `sql/hive` module.

 

As a first step, I made a PR adding a new ORC data source into `sql/core` module.

 

 

Could you give some opinions on this approach?

 

Bests,

Dongjoon.

Loading...