Here are my notes from last night’s sync. I had to leave early, so there may be more discussion. Others can fill in the details for those topics.
John Zhuge Ryan Blue Yifei Huang Matt Cheah Yuanjian Li Russell Spitzer Kevin Yu
Atomic extensions for the TableCatalog API
Moving DSv2 to Catalyst - should this include package renames?
Catalogs and table resolution: proposal to prefer default v2 catalog when defined
Skipping discussion of open PRs
Atomic table catalogs:
Matt: the proposal in the SPIP makes sense. When should Spark use the atomic API? Is there a way for a user to signal that Spark should use the staging calls? Spark could use SQL transaction statements for this.
Ryan: the atomic operations that we are currently targeting with the TableCatalog extensions are single statements, like CREATE TABLE AS SELECT. Transaction statements (e.g., BEGIN) are for multi-statement transactions and are out of scope.
Ryan: Because the expected behavior of the commands (CTAS, RTAS) is that atomic, Spark should use always use atomic implementations if they are available. No need for a user to opt in.
Matt: What should REPLACE TABLE do if transactions are not supported? If the write fails, the table would be deleted
Ryan: REPLACE is a combination of DROP TABLE and CREATE TABLE AS SELECT. By using it, user is signaling that if a combined operation is possible, Spark should use it. So REPLACE TABLE signals intent to drop and it is the right thing to drop the table if an atomic replace is not supported.
There was also some confusion about whether IF EXISTS should be supported. The consensus was that REPLACE TABLE AS SELECT is expected to be idempotent and should not fail if the target table does not exist.
Moving DSv2 to catalyst - skipped because Wenchen did not attend
Catalogs and table resolution:
Ryan: Table resolution with catalogs is getting complicated when namespaces overlap. If an identifier has a catalog, then it is easy to use a v2 catalog. But when the identifier does not have a catalog, there is a namespace overlap between session catalog tables and the default v2 catalog tables. It would be much easier to understand and document if we used a simple rule for precedence. We suggest using session catalog unless the default v2 catalog is defined, then using the v2 catalog by default.
This makes the behavior easy to document and reason about, with few special cases. To guarantee compatibility, we will need a v2 implementation that delegates to session catalog.
Ryan: If there aren’t objections, I’ll raise this on the dev list. We should make a decision there.