Spark SQL reads all leaf directories on a partitioned Hive table

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Spark SQL reads all leaf directories on a partitioned Hive table

invkrh
Hi,
I am using Spark SQL 2.3.3 to read a hive table which is partitioned by day, hour, platform, request_status and is_sampled. The underlying data is in parquet format on HDFS.
Here is the SQL query to read just one partition.

```
spark.sql("""
SELECT rtb_platform_id, SUM(e_cpm)
FROM raw_logs.fact_request
WHERE day = '2019-08-01'
AND hour = '00'
AND platform = 'US'
AND request_status = '3'
AND is_sampled = 1
GROUP BY rtb_platform_id
""").show
```

However, from the Spark web UI, the stage description shows:

```
Listing leaf files and directories for 201616 paths:
viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0, ...
```

It seems the job is reading all of the partitions of the table and the job takes too long for just one partition. One workaround is using `spark.read.parquet` API to read parquet files directly. Spark has partition-awareness for partitioned directories.

But still, I would like to know if there is a way to leverage partition-awareness via Hive by using `spark.sql` API?

Any help is highly appreciated!

Thank you.

--
Hao Ren