[Spark 3.0] DataSourceV2 FileScan - Hive style partition pruning

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spark 3.0] DataSourceV2 FileScan - Hive style partition pruning

Guy Khazma
This post was updated on .
Hi,

It seems that hive style partition pruning is not working for file based
data sources such as Parquet and ORC.
This causes serious performance degradation for non hive tables.

The reason for that is that the FileScan
abstract class is not aware of the partition and data filters.
The method for getting the selectedPartitions calls the FileIndex listFiles
method with empty sequence for both - see here.

In the v1 datasource the FileSourceScanExec class gets the partition and data filters and use them to filter unnecessary
partitions by passing them to the listFiles function - see here.

Are there any ongoing plans to add a support for that?

Thanks,
Guy



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Reply | Threaded
Open this post in threaded view
|

Re: [Spark 3.0] DataSourceV2 FileScan - Hive style partition pruning

Gengliang
Hi Guy,

Thanks for reporting the issue. I am working on it and there will be a PR this week.

Gengliang

On Mon, Dec 30, 2019 at 6:41 AM Guy Khazma <[hidden email]> wrote:
Hi,

It seems that hive style partition pruning is not working for file based
data sources such as Parquet and ORC.
This causes serious performance degradation for non hive tables.

The reason for that is that the  FileScan
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala
abstract class is not aware of the partition and data filters.
The method for getting the selectedPartitions calls the FileIndex listFiles
method with empty sequence for both - see  here
<https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala#L74>
.

In the v1 datasource the  FileSourceScanExec
<https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L160
class gets the partition and data filters and use them to filter unnecessary
partitions by passing them to the listFiles function - see  here
<https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L210>
.

Are there any ongoing plans to add a support for that?

Thanks,
Guy



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Spark 3.0] DataSourceV2 FileScan - Hive style partition pruning

Guy Khazma
This post was updated on .
Thanks Gengliang.

Please let me know if I can help.

Also, It seems to me that the dynamic partition pruning capabilities (SPARK-11150) that were added in Spark 3.0 are also not available for DataSourceV2 file based datasources.
The code for dynamically selecting the partitions was added only to the v1 code path - see here.

Are there plans to add support for v2 as well?


--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org