[DISCUSS] Supporting hive on DataSourceV2

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Supporting hive on DataSourceV2

JackyLee
Hi devs,
I’d like to start a discussion about Supporting Hive on DatasourceV2. We’re
now working on a project using DataSourceV2 to provide multiple source
support and it works with the data lake solution very well, yet it does not
yet support HiveTable.

There are 3 reasons why we need to support Hive on DataSourceV2.
1. Hive itself is one of Spark data sources.
2. HiveTable is essentially a FileTable with its own input and output
formats, it works fine with FileTable.
3. HiveTable should be stateless, and users can freely read or write Hive
using batch or microbatch.

We implemented stateless Hive on DataSourceV1, it supports user to write
into Hive on streaming or batch and it has widely used in our company.
Recently, we are trying to support Hive on DataSourceV2, Multiple Hive
Catalog and DDL Commands have already been supported.

Looking forward to more discussions on this.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Supporting hive on DataSourceV2

Ryan Blue

Hi Jacky,

We’ve internally released support for Hive tables (and Spark FileFormat tables) using DataSourceV2 so that we can switch between catalogs; sounds like that’s what you are planning to build as well. It would be great to work with the broader community on a Hive connector.

I will get a branch of our connectors published so that you can take a look. I think it should be fairly close to what you’re talking about building, with a few exceptions:

  • Our implementation always uses our S3 committers, but it should be easy to change this
  • It supports per-partition formats, like Hive

Do you have an idea about where the connector should be developed? I don’t think it makes sense for it to be part of Spark. That would keep complexity in the main project and require updating Hive versions slowly. Using a separate project would mean less code in Spark specific to one source, and could more easily support multiple Hive versions. Maybe we should create a project for catalog plug-ins?

rb


On Mon, Mar 23, 2020 at 4:20 AM JackyLee <[hidden email]> wrote:
Hi devs,
I’d like to start a discussion about Supporting Hive on DatasourceV2. We’re
now working on a project using DataSourceV2 to provide multiple source
support and it works with the data lake solution very well, yet it does not
yet support HiveTable.

There are 3 reasons why we need to support Hive on DataSourceV2.
1. Hive itself is one of Spark data sources.
2. HiveTable is essentially a FileTable with its own input and output
formats, it works fine with FileTable.
3. HiveTable should be stateless, and users can freely read or write Hive
using batch or microbatch.

We implemented stateless Hive on DataSourceV1, it supports user to write
into Hive on streaming or batch and it has widely used in our company.
Recently, we are trying to support Hive on DataSourceV2, Multiple Hive
Catalog and DDL Commands have already been supported.

Looking forward to more discussions on this.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Supporting hive on DataSourceV2

JackyLee
Glad to hear that you have already supported it, that is just the thing we
are doing. And these exceptions you said doesn't conflict with hive support,
we can easily make it compatible.

>Do you have an idea about where the connector should be developed? I don’t
think it makes sense for it to be part of Spark. That would keep complexity
in the main project and require updating Hive versions slowly. Using a
separate project would mean less code in Spark specific to one source, and
could more easily support multiple Hive versions. Maybe we should create a
project for catalog plug-ins?

AFAIT, it is necessary to create a new project, users need to create their
own Connector according to their own needs. In our implementation of Hive on
DataSourceV2, we put the basic Partition API and Commands in the main
project,  and put a default version HiveCatalog and HiveConnector into the
external project. Users can use our project and can also implement their own
HiveConnector. Maybe this is a good way to support.

Look forward to your patch submission, we can cooperate in this area.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Supporting hive on DataSourceV2

JackyLee
Hi Blue,

I have created a jira for supporting hive on DataSourceV2,we can associate
specific modules on this jira.
https://issues.apache.org/jira/browse/SPARK-31241

Could you provide a google doc for current design, so that we can discuss
and improve it in detail here?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]