DataSourceV2: support for named tables

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

DataSourceV2: support for named tables

Ryan Blue

There are two main ways to load tables in Spark: by name (db.table) and by a path. Unfortunately, the integration for DataSourceV2 has no support for identifying tables by name.

I propose supporting the use of TableIdentifier, which is the standard way to pass around table names.

The reason I think we should do this is to easily support more ways of working with DataSourceV2 tables. SQL statements and parts of the DataFrameReader and DataFrameWriter APIs that use table names create UnresolvedRelation instances that wrap an unresolved TableIdentifier.

By adding support for passing TableIdentifier to a DataSourceV2Relation, then about all we need to enable these code paths is to add a resolution rule. For that rule, we could easily identify a default data source that handles named tables.

This is what we’re doing in our Spark build, and we have DataSourceV2 tables working great through SQL. (Part of this depends on the logical plan changes from my previous email to ensure inserts are properly resolved.)

In the long term, I think we should update how we parse tables so that TableIdentifier can contain a source in addition to a database/context and a table name. That would allow us to integration new sources fairly seamlessly, without needing to a rather redundant SQL create statement like this:

CREATE TABLE database.name USING source OPTIONS (table 'database.name')

Also, I think we should pass TableIdentifier to DataSourceV2Relation, rather than going with Wenchen’s suggestion that we pass the table name as a string property, “table”. My rationale is that the new API shouldn’t leak its internal details to other parts of the planner.

If we were to convert TableIdentifer to a “table” property wherever DataSourceV2Relation is created, we create several places that need to be in sync with the same convention. On the other hand, passing TableIdentifier to DataSourceV2Relation and relying on the relation to correctly set the options passed to readers and writers minimizes the number of places that conversion needs to happen.

rb

--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: DataSourceV2: support for named tables

Michael Armbrust
I am definitely in favor of first-class / consistent support for tables and data sources.

One thing that is not clear to me from this proposal is exactly what the interfaces are between:
 - Spark
 - A (The?) metastore
 - A data source

If we pass in the table identifier is the data source then responsible for talking directly to the metastore? Is that what we want? (I'm not sure)

On Fri, Feb 2, 2018 at 10:39 AM, Ryan Blue <[hidden email]> wrote:

There are two main ways to load tables in Spark: by name (db.table) and by a path. Unfortunately, the integration for DataSourceV2 has no support for identifying tables by name.

I propose supporting the use of TableIdentifier, which is the standard way to pass around table names.

The reason I think we should do this is to easily support more ways of working with DataSourceV2 tables. SQL statements and parts of the DataFrameReader and DataFrameWriter APIs that use table names create UnresolvedRelation instances that wrap an unresolved TableIdentifier.

By adding support for passing TableIdentifier to a DataSourceV2Relation, then about all we need to enable these code paths is to add a resolution rule. For that rule, we could easily identify a default data source that handles named tables.

This is what we’re doing in our Spark build, and we have DataSourceV2 tables working great through SQL. (Part of this depends on the logical plan changes from my previous email to ensure inserts are properly resolved.)

In the long term, I think we should update how we parse tables so that TableIdentifier can contain a source in addition to a database/context and a table name. That would allow us to integration new sources fairly seamlessly, without needing to a rather redundant SQL create statement like this:

CREATE TABLE database.name USING source OPTIONS (table 'database.name')

Also, I think we should pass TableIdentifier to DataSourceV2Relation, rather than going with Wenchen’s suggestion that we pass the table name as a string property, “table”. My rationale is that the new API shouldn’t leak its internal details to other parts of the planner.

If we were to convert TableIdentifer to a “table” property wherever DataSourceV2Relation is created, we create several places that need to be in sync with the same convention. On the other hand, passing TableIdentifier to DataSourceV2Relation and relying on the relation to correctly set the options passed to readers and writers minimizes the number of places that conversion needs to happen.

rb

--
Ryan Blue
Software Engineer
Netflix

Reply | Threaded
Open this post in threaded view
|

Re: DataSourceV2: support for named tables

Ryan Blue

I don’t have a good answer for that yet. My initial motivation here is mainly to get consensus around this:

  • DSv2 should support table names through SQL and the API, and
  • It should use the existing classes in the logical plan (i.e., TableIdentifier)

To contrast, I think Wenchen is arguing that the new API should pass string options to data sources and let the implementations decide what to do. (I hope I’ve characterized that correctly.)

We’re storing references to our Iceberg tables in the metastore, so I added a simple rule to resolve TableIdentifier to DataSourceV2Relation(IcebergSource). Everything worked fine for us after that, but it isn’t a solution that will work in Spark upstream.

I think the right long-term solution is to mirror what Presto does. It adds an optional catalog context to table identifiers: catalog.database.table. Catalogs are defined in configuration and can share implementations, so you can have two catalogs using the JDBC source or two Hive metastores.

The Presto model would require a bit of work, but I think it is going to be necessary anyway before DSv2 is finished and really usable. We need to separate table creation from writes to ensure consistent behavior, while making it easier to implement the v2 write API. Otherwise, we will end up with whatever semantics each data source implementation chooses: overwrite might mean recreate the entire table, or might mean overwrite partitions.

rb


On Fri, Feb 2, 2018 at 4:11 PM, Michael Armbrust <[hidden email]> wrote:
I am definitely in favor of first-class / consistent support for tables and data sources.

One thing that is not clear to me from this proposal is exactly what the interfaces are between:
 - Spark
 - A (The?) metastore
 - A data source

If we pass in the table identifier is the data source then responsible for talking directly to the metastore? Is that what we want? (I'm not sure)

On Fri, Feb 2, 2018 at 10:39 AM, Ryan Blue <[hidden email]> wrote:

There are two main ways to load tables in Spark: by name (db.table) and by a path. Unfortunately, the integration for DataSourceV2 has no support for identifying tables by name.

I propose supporting the use of TableIdentifier, which is the standard way to pass around table names.

The reason I think we should do this is to easily support more ways of working with DataSourceV2 tables. SQL statements and parts of the DataFrameReader and DataFrameWriter APIs that use table names create UnresolvedRelation instances that wrap an unresolved TableIdentifier.

By adding support for passing TableIdentifier to a DataSourceV2Relation, then about all we need to enable these code paths is to add a resolution rule. For that rule, we could easily identify a default data source that handles named tables.

This is what we’re doing in our Spark build, and we have DataSourceV2 tables working great through SQL. (Part of this depends on the logical plan changes from my previous email to ensure inserts are properly resolved.)

In the long term, I think we should update how we parse tables so that TableIdentifier can contain a source in addition to a database/context and a table name. That would allow us to integration new sources fairly seamlessly, without needing to a rather redundant SQL create statement like this:

CREATE TABLE database.name USING source OPTIONS (table 'database.name')

Also, I think we should pass TableIdentifier to DataSourceV2Relation, rather than going with Wenchen’s suggestion that we pass the table name as a string property, “table”. My rationale is that the new API shouldn’t leak its internal details to other parts of the planner.

If we were to convert TableIdentifer to a “table” property wherever DataSourceV2Relation is created, we create several places that need to be in sync with the same convention. On the other hand, passing TableIdentifier to DataSourceV2Relation and relying on the relation to correctly set the options passed to readers and writers minimizes the number of places that conversion needs to happen.

rb

--
Ryan Blue
Software Engineer
Netflix




--
Ryan Blue
Software Engineer
Netflix