DSv2 sync notes - 21 August 2019

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

DSv2 sync notes - 21 August 2019

Ryan Blue

Sorry these notes were delayed. Here’s what we talked about in the last DSv2 sync.

Attendees:

Ryan Blue
John Zhuge
Burak Yavuz
Gengliang Wang
Terry Kim
Wenchen Fan
Xin Ren
Srabasti Banerjee
Priyanka Gomatam

Topics:

Discussion:

  • Insert in DataFrameWriter v2 API:
    • Ryan: After reviewing the doc that Russel sent) last time, it doesn’t look like there is precedent for insert implementing upsert, without an additional clause like ON DUPLICATE KEY UPDATE or ON CONFLICT .... I think that means that insert should not be used for upsert and it is correct to use the verb “append” in the new API.
    • Wenchen: Spark already supports INSERT OVERWRITE that has no precedent other than Hive
    • Ryan: Good point. INSERT OVERWRITE is a partition-level replace. If we think of single-key stores as partitioned by row key, then the dynamic INSERT OVERWRITE behavior is appropriate.
    • Ryan: One other reason to change “append” to “insert” is to match SQL. Should we consider renaming for consistency with SQL?
    • Burak: SQL inserts are by position and DataFrameWriterV2 appends are by name. DataFrameWriter (v1) uses position for insertInto, so there is a precedent that insert is by position.
    • Ryan: I agree with that logic. It makes more sense to use append to distinguish behavior.
    • Consensus was to keep the “append” verb, but will discuss when Russel is back.
    • Burak: (Continuing from DataFrameWriterV2 discussion) The v2 writer looks fine other than the partition functions are close to built-in expression functions (year vs years).
    • Consensus was to use “partitioning.years” for partitioning functions.
  • Changes to CatalogPlugin for v2 session catalog implementations
    • Wenchen: this adds a new config for overriding v2 session catalog, and a new abstract class that must be implemented
    • Ryan: Why a new config? If we intend for a user to be able to override this, then we already have a mechanism to configure it using the “session” catalog name.
    • Discussion on pros and cons of using a different config, consensus was to use the existing CatalogPlugin config
    • Ryan: Looks like this uses TableCatalog for the actual API and passes in the built-in V2SessionCatalog. That sounds like a good idea to me, instead of introducing a new API.
    • Burak: What about databases named “session”?
    • Ryan: Catalogs take precedence over databases, so the session catalog will be used for “session.table”.
    • Burak: Sounds like this is going to break existing queries then.
    • Ryan: I think the general rule that catalogs should take precedence is right. It would be worse to allow users creating databases to break other users’ catalogs — we avoid that problem with catalogs because they are limited to jobs when users create them and are otherwise an administrator option. But, session is a special case because this is Spark building a catalog into all environments… I think that’s a good reason to name it something that we think is unlikely to conflict.
    • Discussion came up with several alternatives (session, built_in, etc) but consensus settled on “spark_catalog”. That’s more descriptive and much less likely to conflict with existing databases.
  • Remove SaveMode: this was done by Wenchen’s commit that disabled file sources.
--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: DSv2 sync notes - 21 August 2019

RussS
Sorry all, I'm out on vacation till the week after next, I'll chime in then unless we need consensus sooner

On Fri, Aug 30, 2019, 4:05 PM Ryan Blue <[hidden email]> wrote:

Sorry these notes were delayed. Here’s what we talked about in the last DSv2 sync.

Attendees:

Ryan Blue
John Zhuge
Burak Yavuz
Gengliang Wang
Terry Kim
Wenchen Fan
Xin Ren
Srabasti Banerjee
Priyanka Gomatam

Topics:

Discussion:

  • Insert in DataFrameWriter v2 API:
    • Ryan: After reviewing the doc that Russel sent) last time, it doesn’t look like there is precedent for insert implementing upsert, without an additional clause like ON DUPLICATE KEY UPDATE or ON CONFLICT .... I think that means that insert should not be used for upsert and it is correct to use the verb “append” in the new API.
    • Wenchen: Spark already supports INSERT OVERWRITE that has no precedent other than Hive
    • Ryan: Good point. INSERT OVERWRITE is a partition-level replace. If we think of single-key stores as partitioned by row key, then the dynamic INSERT OVERWRITE behavior is appropriate.
    • Ryan: One other reason to change “append” to “insert” is to match SQL. Should we consider renaming for consistency with SQL?
    • Burak: SQL inserts are by position and DataFrameWriterV2 appends are by name. DataFrameWriter (v1) uses position for insertInto, so there is a precedent that insert is by position.
    • Ryan: I agree with that logic. It makes more sense to use append to distinguish behavior.
    • Consensus was to keep the “append” verb, but will discuss when Russel is back.
    • Burak: (Continuing from DataFrameWriterV2 discussion) The v2 writer looks fine other than the partition functions are close to built-in expression functions (year vs years).
    • Consensus was to use “partitioning.years” for partitioning functions.
  • Changes to CatalogPlugin for v2 session catalog implementations
    • Wenchen: this adds a new config for overriding v2 session catalog, and a new abstract class that must be implemented
    • Ryan: Why a new config? If we intend for a user to be able to override this, then we already have a mechanism to configure it using the “session” catalog name.
    • Discussion on pros and cons of using a different config, consensus was to use the existing CatalogPlugin config
    • Ryan: Looks like this uses TableCatalog for the actual API and passes in the built-in V2SessionCatalog. That sounds like a good idea to me, instead of introducing a new API.
    • Burak: What about databases named “session”?
    • Ryan: Catalogs take precedence over databases, so the session catalog will be used for “session.table”.
    • Burak: Sounds like this is going to break existing queries then.
    • Ryan: I think the general rule that catalogs should take precedence is right. It would be worse to allow users creating databases to break other users’ catalogs — we avoid that problem with catalogs because they are limited to jobs when users create them and are otherwise an administrator option. But, session is a special case because this is Spark building a catalog into all environments… I think that’s a good reason to name it something that we think is unlikely to conflict.
    • Discussion came up with several alternatives (session, built_in, etc) but consensus settled on “spark_catalog”. That’s more descriptive and much less likely to conflict with existing databases.
  • Remove SaveMode: this was done by Wenchen’s commit that disabled file sources.
--
Ryan Blue
Software Engineer
Netflix