DataSourceV2 sync notes - 29 May 2019

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

DataSourceV2 sync notes - 29 May 2019

Ryan Blue

Here are my notes from last night’s sync. I had to leave early, so there may be more discussion. Others can fill in the details for those topics.

Attendees:

John Zhuge
Ryan Blue
Yifei Huang
Matt Cheah
Yuanjian Li
Russell Spitzer
Kevin Yu

Topics:

  • Atomic extensions for the TableCatalog API
  • Moving DSv2 to Catalyst - should this include package renames?
  • Catalogs and table resolution: proposal to prefer default v2 catalog when defined

Notes:

  • Skipping discussion of open PRs
  • Atomic table catalogs:
    • Matt: the proposal in the SPIP makes sense. When should Spark use the atomic API? Is there a way for a user to signal that Spark should use the staging calls? Spark could use SQL transaction statements for this.
    • Ryan: the atomic operations that we are currently targeting with the TableCatalog extensions are single statements, like CREATE TABLE AS SELECT. Transaction statements (e.g., BEGIN) are for multi-statement transactions and are out of scope.
    • Ryan: Because the expected behavior of the commands (CTAS, RTAS) is that atomic, Spark should use always use atomic implementations if they are available. No need for a user to opt in.
    • Matt: What should REPLACE TABLE do if transactions are not supported? If the write fails, the table would be deleted
    • Ryan: REPLACE is a combination of DROP TABLE and CREATE TABLE AS SELECT. By using it, user is signaling that if a combined operation is possible, Spark should use it. So REPLACE TABLE signals intent to drop and it is the right thing to drop the table if an atomic replace is not supported.
    • There was also some confusion about whether IF EXISTS should be supported. The consensus was that REPLACE TABLE AS SELECT is expected to be idempotent and should not fail if the target table does not exist.
  • Moving DSv2 to catalyst - skipped because Wenchen did not attend
  • Catalogs and table resolution:
    • Ryan: Table resolution with catalogs is getting complicated when namespaces overlap. If an identifier has a catalog, then it is easy to use a v2 catalog. But when the identifier does not have a catalog, there is a namespace overlap between session catalog tables and the default v2 catalog tables. It would be much easier to understand and document if we used a simple rule for precedence. We suggest using session catalog unless the default v2 catalog is defined, then using the v2 catalog by default.
    • This makes the behavior easy to document and reason about, with few special cases. To guarantee compatibility, we will need a v2 implementation that delegates to session catalog.
    • Ryan: If there aren’t objections, I’ll raise this on the dev list. We should make a decision there.
--
Ryan Blue
Software Engineer
Netflix