Catalog API for Partition

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Catalog API for Partition

JackyLee
Hi devs,

In order to support Partition Commands for datasourcev2 and Lakehouse, I'm
trying to add Partition API for multiple Catalog.

They are widely used APIs in mysql or hive or other datasources, we can use
these API to mange Partition metadata in Lakehouse.

JIRA: https://issues.apache.org/jira/browse/SPARK-31694
PR: https://github.com/apache/spark/pull/28617

We have already use these APIs to support Lakehouse on Delta Lake and hive
on datasourcev2, and it does solves partition supports on datasourcev2.
Could anyone review it?

Thanks,
Jacky Lee



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Catalog API for Partition

cloud0fan
In Hive, partition does two things:
1. Act as an index to speed up data scan
2. Act as a way to manage the data. People can add/drop partitions.

How do you unify these 2 things in your API design?

On Fri, Jul 17, 2020 at 12:03 AM JackyLee <[hidden email]> wrote:
Hi devs,

In order to support Partition Commands for datasourcev2 and Lakehouse, I'm
trying to add Partition API for multiple Catalog.

They are widely used APIs in mysql or hive or other datasources, we can use
these API to mange Partition metadata in Lakehouse.

JIRA: https://issues.apache.org/jira/browse/SPARK-31694
PR: https://github.com/apache/spark/pull/28617

We have already use these APIs to support Lakehouse on Delta Lake and hive
on datasourcev2, and it does solves partition supports on datasourcev2.
Could anyone review it?

Thanks,
Jacky Lee



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Catalog API for Partition

JackyLee
Hi, wenchen. Thanks for your attention and reply.

Firstly. These Partition Catalog APIs are not specially used for hive, they
can be used with LakeHouse or myql or other source support partitions.
Secondly. These Partition Catalog APIs are only designed for better data
management, not for speed up data scan. The API used to speed up hive data
scan are different from these APIs.

Currently, we use Hive Catalog APIs to support speeding hive data scan and
write data into hive. However, we are trying to redefine HiveTable, which
implements FileTable, and use PartitioningPruning to support speed up hive
scan. Privately, I think this is a better way to support hive in
datasourcev2.

Thanks again.
Jacky Lee



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Catalog API for Partition

cloud0fan
Yea we don't want the partitions to be Hive-specific. My point is, we call it "Partition Catalog APIs", which makes me confused about the relationship between this and the "partitions" in `TableCatalog.createTable`. Are these two orthogonal? Or you kind of unify them?

On Sat, Jul 18, 2020 at 12:02 AM JackyLee <[hidden email]> wrote:
Hi, wenchen. Thanks for your attention and reply.

Firstly. These Partition Catalog APIs are not specially used for hive, they
can be used with LakeHouse or myql or other source support partitions.
Secondly. These Partition Catalog APIs are only designed for better data
management, not for speed up data scan. The API used to speed up hive data
scan are different from these APIs.

Currently, we use Hive Catalog APIs to support speeding hive data scan and
write data into hive. However, we are trying to redefine HiveTable, which
implements FileTable, and use PartitioningPruning to support speed up hive
scan. Privately, I think this is a better way to support hive in
datasourcev2.

Thanks again.
Jacky Lee



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Catalog API for Partition

JackyLee
The `partitioning` in `TableCatalog.createTable` is a partition schema for
table, which doesn't contains the partition metadata for an actual
partition. Besides, the actual partition metadata may contains many
partition schema, such as hive partition.
Thus I created a `TablePartition` to contains the partition metadata which
can be distinguished from Transform and created `Partition Catalog APIs` to
manage partition metadata.

In short, the `TablePartition` contains multiple `Transform` and partition
metadata for an actual partition in hive or other datasource.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]