Performance regression for partitioned parquet data

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Performance regression for partitioned parquet data

Bertrand Bossy
Hi,

since moving to spark 2.1 from 2.0, we experience a performance regression when reading a large, partitioned parquet dataset: 

We observe many (hundreds) very short jobs executing before the job that reads the data is starting. I looked into this issue and pinned it down to PartitioningAwareFileIndex: While recursively listing the directories, if a directory contains more than "spark.sql.sources.parallelPartitionDiscovery.threshold" (default: 32) paths, the children are listed using a spark job. Because the tree is listed serially, this can result in a lot of small spark jobs executed one after the other and the overhead dominates. Performance can be improved by tuning "spark.sql.sources.parallelPartitionDiscovery.threshold". However, this is not a satisfactory solution.

I think that the current behaviour could be improved by walking the directory tree in breadth first search order and only launching one spark job to list files in parallel if the number of paths to be listed at some level exceeds spark.sql.sources.parallelPartitionDiscovery.threshold .

Does this approach make sense? I have found "Regression in file listing performance" ( https://issues.apache.org/jira/browse/SPARK-18679 ) as the most closely related ticket.

Unless there is a reason for the current behaviour, I will create a ticket on this soon. I might have some time in the coming days to work on this.

Regards,
Bertrand

--

Bertrand Bossy | TERALYTICS

software engineer

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland 
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de Vries

This e-mail message contains confidential information which is for the sole attention and use of the intended recipient. Please notify us at once if you think that it may not be intended for you and delete it immediately.



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Performance regression for partitioned parquet data

Michael Allman-2
Hi Bertrand,

I encourage you to create a ticket for this and submit a PR if you have time. Please add me as a listener, and I'll try to contribute/review.

Michael

On Jun 6, 2017, at 5:18 AM, Bertrand Bossy <[hidden email]> wrote:

Hi,

since moving to spark 2.1 from 2.0, we experience a performance regression when reading a large, partitioned parquet dataset: 

We observe many (hundreds) very short jobs executing before the job that reads the data is starting. I looked into this issue and pinned it down to PartitioningAwareFileIndex: While recursively listing the directories, if a directory contains more than "spark.sql.sources.parallelPartitionDiscovery.threshold" (default: 32) paths, the children are listed using a spark job. Because the tree is listed serially, this can result in a lot of small spark jobs executed one after the other and the overhead dominates. Performance can be improved by tuning "spark.sql.sources.parallelPartitionDiscovery.threshold". However, this is not a satisfactory solution.

I think that the current behaviour could be improved by walking the directory tree in breadth first search order and only launching one spark job to list files in parallel if the number of paths to be listed at some level exceeds spark.sql.sources.parallelPartitionDiscovery.threshold .

Does this approach make sense? I have found "Regression in file listing performance" ( https://issues.apache.org/jira/browse/SPARK-18679 ) as the most closely related ticket.

Unless there is a reason for the current behaviour, I will create a ticket on this soon. I might have some time in the coming days to work on this.

Regards,
Bertrand

--

Bertrand Bossy | TERALYTICS

software engineer

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland 
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de Vries

This e-mail message contains confidential information which is for the sole attention and use of the intended recipient. Please notify us at once if you think that it may not be intended for you and delete it immediately.




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Performance regression for partitioned parquet data

Meihua Wu
I might have a similar problem:

in the spark-shell:
val data = spark.read.parquet("...")

after hitting enter, it takes more than 30 seconds for the "read" to complete and return the command line. I am running Spark 2.1.1. But I have also tested it on 2.0.2 and encountered the same issue. 

thanks,

Mike



On Tue, Jun 13, 2017 at 10:05 AM, Michael Allman <[hidden email]> wrote:
Hi Bertrand,

I encourage you to create a ticket for this and submit a PR if you have time. Please add me as a listener, and I'll try to contribute/review.

Michael

On Jun 6, 2017, at 5:18 AM, Bertrand Bossy <[hidden email]> wrote:

Hi,

since moving to spark 2.1 from 2.0, we experience a performance regression when reading a large, partitioned parquet dataset: 

We observe many (hundreds) very short jobs executing before the job that reads the data is starting. I looked into this issue and pinned it down to PartitioningAwareFileIndex: While recursively listing the directories, if a directory contains more than "spark.sql.sources.parallelPartitionDiscovery.threshold" (default: 32) paths, the children are listed using a spark job. Because the tree is listed serially, this can result in a lot of small spark jobs executed one after the other and the overhead dominates. Performance can be improved by tuning "spark.sql.sources.parallelPartitionDiscovery.threshold". However, this is not a satisfactory solution.

I think that the current behaviour could be improved by walking the directory tree in breadth first search order and only launching one spark job to list files in parallel if the number of paths to be listed at some level exceeds spark.sql.sources.parallelPartitionDiscovery.threshold .

Does this approach make sense? I have found "Regression in file listing performance" ( https://issues.apache.org/jira/browse/SPARK-18679 ) as the most closely related ticket.

Unless there is a reason for the current behaviour, I will create a ticket on this soon. I might have some time in the coming days to work on this.

Regards,
Bertrand

--

Bertrand Bossy | TERALYTICS

software engineer

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland 
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de Vries

This e-mail message contains confidential information which is for the sole attention and use of the intended recipient. Please notify us at once if you think that it may not be intended for you and delete it immediately.





Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Performance regression for partitioned parquet data

Bertrand Bossy
Hi,


I'll try to address cloud-fan's comment ASAP

Any input welcome.

Regards,
Bertrand

On Thu, Jun 15, 2017 at 1:27 AM, Mike Wheeler <[hidden email]> wrote:
I might have a similar problem:

in the spark-shell:
val data = spark.read.parquet("...")

after hitting enter, it takes more than 30 seconds for the "read" to complete and return the command line. I am running Spark 2.1.1. But I have also tested it on 2.0.2 and encountered the same issue. 

thanks,

Mike



On Tue, Jun 13, 2017 at 10:05 AM, Michael Allman <[hidden email]> wrote:
Hi Bertrand,

I encourage you to create a ticket for this and submit a PR if you have time. Please add me as a listener, and I'll try to contribute/review.

Michael

On Jun 6, 2017, at 5:18 AM, Bertrand Bossy <[hidden email]> wrote:

Hi,

since moving to spark 2.1 from 2.0, we experience a performance regression when reading a large, partitioned parquet dataset: 

We observe many (hundreds) very short jobs executing before the job that reads the data is starting. I looked into this issue and pinned it down to PartitioningAwareFileIndex: While recursively listing the directories, if a directory contains more than "spark.sql.sources.parallelPartitionDiscovery.threshold" (default: 32) paths, the children are listed using a spark job. Because the tree is listed serially, this can result in a lot of small spark jobs executed one after the other and the overhead dominates. Performance can be improved by tuning "spark.sql.sources.parallelPartitionDiscovery.threshold". However, this is not a satisfactory solution.

I think that the current behaviour could be improved by walking the directory tree in breadth first search order and only launching one spark job to list files in parallel if the number of paths to be listed at some level exceeds spark.sql.sources.parallelPartitionDiscovery.threshold .

Does this approach make sense? I have found "Regression in file listing performance" ( https://issues.apache.org/jira/browse/SPARK-18679 ) as the most closely related ticket.

Unless there is a reason for the current behaviour, I will create a ticket on this soon. I might have some time in the coming days to work on this.

Regards,
Bertrand

--

Bertrand Bossy | TERALYTICS

software engineer

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland 
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de Vries

This e-mail message contains confidential information which is for the sole attention and use of the intended recipient. Please notify us at once if you think that it may not be intended for you and delete it immediately.








--

Bertrand Bossy | TERALYTICS

software engineer

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland 
phone:
 +41 78 821 95 00
email: [hidden email]

www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de Vries

This e-mail message contains confidential information which is for the sole attention and use of the intended recipient. Please notify us at once if you think that it may not be intended for you and delete it immediately.



Loading...