We use partitioned + bucketed datasets for use-cases where we can afford to take a perf hit at write time, so that reads are optimised. But I feel Spark could more optimally exploit the data layout in query planning. Here I describe why this is a problem, and how it could be improved.
There are a class of common problems that can't be solved by:
On Spark, when you try to both Partition and Bucket a dataset, the format on disk and in the metastore is correctly recorded. But this information isn't optimally used for query planning because:
This will involve changes some aspects of query planning. Here I list the top level changes:
There is a strong requirement for this functionality at my team (in Amazon). I've opened a JIRA regarding this issue here. I did consider DataSourcesV2, but it looks like a better fit here.
I wanted some inputs regarding this. Is this approach feasible and is it aligned with how Spark wants to handle native datasources in the future? Does anyone else have similar requirements?
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
|Free forum by Nabble||Edit this page|