Quantcast

Spark reading parquet files behaved differently with number of paths

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Spark reading parquet files behaved differently with number of paths

Yash Sharma
Hi Fellow Devs,
I have noticed the spark parquet reader behaves very differently in the two scenarios over the same data set while:
1. passing a single path to parent path to data, vs
2. passing all the files individually to parquet(paths: String*)

The paths has about ~50K files. The first option is able to cope up with all the data and the job completes in few hours, however, for a use case where a subset of paths has to be passed, the job is just stuck for few hours and dies later. It doesn't start executing anything and seems like its doing some sort of 'file path exists' check sequentially before starting the job.

Has anyone stumbled upon this issue ?

Appreciate any pointers.

Snippet:
events = spark.read \
.schema(file_schema) \
.option("basePath", 's3://path/to/data/') \
.parquet(*list_of_paths)
Loading...