Terry: the idea is to always resolve identifiers the same way and not to resolve a temp view in some cases but not others. If an identifier is a temp view and is used in a context where you can’t use a view, it should fail instead of finding a table.
Ryan: does this need to be done by 3.0?
Consensus was that it should be done for 3.0
Ryan: not much activity on the dev list thread for this. Do we move forward anyway?
Wenchen: okay to fix because the scope is small
Consensus was to go ahead and notify the dev list about changes because this is a low-risk case that does not occur often (table and temp view conflict)
Burak: cached tables are similar: for insert you get the new results.
Ryan: is that the same issue or a similar problem to fix?
Burak: similar, it can be done separately
Ryan: does this also need to be fixed by 3.0?
Wenchen: it is a blocker (Yes). Spark should invalidate the cached table after a write
Ryan: There’s another issue: how do we handle a permanent view with a name that resolves to a temp view? If incorrect, this changes the results of a stored view.
Wenchen: This is currently broken, Spark will resolve the relation as a temp view. But Spark could use the analysis context to fix this.
Ryan: We should fix this when fixing temp views.
Nested schema pruning:
Dongjoon: Nested schema pruning was only done for Parquet and ORC instead of all v2 sources. Anton submitted a PR that fixes it.
At the time, the PR had +1s and was pending some minor discussion. It was merged the next day.
Wenchen: Spark always calls 1 method to load a table. The implementation can do schema and partition inference in that method. Forcing this to be separated into other methods causes problems in the file source. FileIndex is used for all these tasks.
Ryan: I’m not sure that existing file source code is a good enough justification to change the proposed API. Seems too path dependent.
Ryan: It is also strange to have the source of truth for schema information differ between code paths. Some getTable uses would pass the schema to the source (from metastore) with TableProvider, but some would instead rely on the table from getTable to provide its own schema. This is confusing to implementers.
Burak: The default mode in DataFrameWriter is ErrorIfExists, which doesn’t currently work with v2 sources. Moving from Kafka to KafkaV2, for example, would probably break.
Ryan: So do we want to get extractCatalog and extractIdentifier into 3.0? Or is this blocked by the infer changes?
Burak: It would be good to have.
Wenchen: Schema may be inferred, or provided by Spark
Ryan: Sources should specify whether they accept a user-specified schema. But either way, the schema is still external and passed into the table. The main decision is whether all cases (inference included) should pass the schema into the table.
Tasks for 3.0
Decided to get temp view resolution fixed
Decided to get TableProvider changes in
extractCatalog/extractIdentifier are nice-to-have (but small)
Burak: Upgrading to v2 saveAsTable from DataFrameWriter v1 creates RTAS, but in Delta v1 would only overwrite the schema if requested. It would be nice to be able to select
Ryan: Standardizing behavior (replace vs truncate vs dynamic overwrite) is a main point of v2. Allowing sources to choose their own behavior is not supported in v2 so that we can guarantee consistent semantics across tables. Making a way for Delta to change its semantics doesn’t seem like a good idea. To have identical semantics, use Delta as a v1 source.
Spark 3.1 goals?
Ryan: I suggest metadata columns and metadata tables (e.g. partitions) for metadata query optimization
Wenchen: ViewCatalog and FunctionCatalog
Ryan: will send out a doc on FunctionCatalog when I can write it up
Ryan: also MERGE INTO. this will require returning a required distribution and sort order.