Spark SQL reads all leaf directories on a partitioned Hive table
I am using Spark SQL 2.3.3 to read a hive table which is partitioned by day, hour, platform, request_status and is_sampled. The underlying data is in parquet format on HDFS.
Here is the SQL query to read just one partition.
``` spark.sql(""" SELECT rtb_platform_id, SUM(e_cpm) FROM raw_logs.fact_request WHERE day = '2019-08-01' AND hour = '00' AND platform = 'US' AND request_status = '3' AND is_sampled = 1 GROUP BY rtb_platform_id """).show
However, from the Spark web UI, the stage description shows:
``` Listing leaf files and directories for 201616 paths:
It seems the job is reading all of the partitions of the table and the job takes too long for just one partition. One workaround is using `spark.read.parquet` API to read parquet files directly. Spark has partition-awareness for partitioned directories.
But still, I would like to know if there is a way to leverage partition-awareness via Hive by using `spark.sql` API?