Make DataFrameWriter compatible when updating a source from v1 to v2 (Burak)
Spark 2.5 release?
Ryan: We are very close to feature complete for a release, but 3.0 is still a few months away. I’m going to be backporting DSv2 onto Spark 2.4 and I could contribute this work for a 2.5 release. I have had lots of people asking when DSv2 will be available and since it isn’t a breaking change we don’t need to wait until 3.0.
Wenchen: it would be good to have a 2.x release with the same DSv2 support as 3.0 so source authors only need to support one DSv2 API. There are substantial changes to it since 2.4.
Ryan: Good point about compatibility, that would make using DSv2 much easier.
Holden: It would also be helpful to support Java 11 for the same reason.
Ryan: I think it makes sense to do a compatibility release in preparation for 3.0 then. I’ll bring this up on the dev list.
DataFrameWriterV2 python API:
Ryan: I opened SPARK-29157 to add support for the new DataFrameWriterV2 API in python. This is probably a good starter issue if anyone wants to work on it.
Wenchen: This refactor cleans up the default/current catalog and also separates catalog resolution from table resolution
Ryan: One of my main concerns is not having conversion in rules, but in the ParsedStatement plans themselves
Wenchen: The PR has been updated, that is no longer the case and it uses extractors.
Ryan: Will take another look, then. Also, I’d like to keep the rules in Analyzer specific to DSv2 and keep the v1 fallback rules in DataSourceResolution. That way fallback is a special case and we can remove it without rewriting the v2 rules.
Wenchen: That would also allow us to keep the v2 rules in catalyst instead of in sql-core, will make the change
Partitioning in DataFrameWriter.save:
Burak: DataFrameWriter.save uses a different default SaveMode, which changes behavior when an implementation moves from v1 to v2. The idea behind these PRs is to allow a TableProvider implementation to provide create/exists/drop.
Ryan: This line of reasoning is why we introduced the TableCatalog API: to implement CTAS, we need to create a table; but to make this reliable we need to know whether the table already exists; then to handle failure, we need to be able to drop the table if the write fails. The progression of PRs mirrors this: the first adds a create and exists, the second adds drop.
Ryan: I think it is helpful to re-frame the problem. The issue is that a read/write implementation loads a specific class, which doesn’t have an associated a catalog because there was only one catalog when this was designed. The read/write implementation class alone can’t determine the right catalog to use. To go from a read/write implementation to a catalog, we have 3 options:
Don’t allow operations that require a catalog, i.e. no CTAS or ErrorIfExists mode. This is what we currently do, but moving from v1 to v2 means the default changes to append, or fails because ErrorIfExists isn’t supported
Always use spark_catalog because that’s what has always been used before
Allow the source to extract the catalog from the DataFrameReader/DataFrameWriter options.
Matt: I like the third option, where the implementation can extract a catalog from the read/write options.
Burak: That option would allow using all of the same DSv2 plans: use the options to get a catalog and a table identifier
Consensus was to go with the third option: add extractCatalogName(Options) and extractIdentifier(Options)
There was also some discussion about path-based tables, but these are out of scope because behavior is not yet defined
Ryan: Going with option 3 still requires Wenchen’s changes to TableProvider to pass all the table information. The DataFrame API will get a catalog and identifier pair, but instantiating a Table instance in a generic metastore still requires the buildTable method.