DataSourceV2 sync notes - 24 July 2019

Here are my notes from the last DSv2 sync. Sorry it's a bit late!


Ryan Blue
John Zhuge
Raynmond McCollum
Terry Kim
Gengliang Wang
Jose Torres
Wenchen Fan
Priyanka Gomatam
Matt Cheah
Russel Spitzer
Burak Yavuz



  • Blockers
    • Remove SaveMode from file sources: Blocked by TableProvider/CatalogPlugin changes. Doesn’t work with all of the using clauses from v1, like JDBC. Working on a CatalogPlugin fix.
    • Reorganize packages: Blocked by outstanding INSERT INTO PRs
    • Docs: Ryan: docs can be written after branching, so focus should be on stability right now
    • Any other blockers? Please send them to Ryan to track
  • V2 session catalog config PR:
    • Wenchen: this will be included in CatalogPlugin changes
    • Matt: waiting for review
    • Burak: partitioning is strange, uses “Part 0” instead of names
    • Ryan: there are no names for transform partitions (identity partitions use column names)
    • Conclusion: not a big problem since there is no required schema, we can update later if better ideas come up
    • Ryan: ready for another review, DataFrameWriter.insertInto PR will follow
  • SupportsNamespaces PR:
    • Ryan: ready for another review
    • Terry: there are open questions: what is the current database for v2?
    • Ryan: there should be a current namespace in the SessionState. This could be per catalog?
    • Conclusion: do not track current namespace per catalog. Reset to a catalog default when current catalog changes
    • Ryan: will add SupportsNamespace method for default namespace to initialize current.
    • Burak: USE could set both
    • What is SupportsNamespaces is not implemented? Default to Seq.empty
    • Terry: should listing methods support search patterns?
    • Ryan: this adds complexity that should be handled by Spark instead of complicating the API. There isn’t a performance need to push this down because we don’t expect high cardinality for a namespace level.
    • Conclusion: implement in SHOW TABLES exec
    • Terry: how should temporary tables be handled?
    • Wenchen: temporary table is an alias for temporary view. SHOW TABLES does list temporary views, v2 should implement the same behavior.
    • Terry: support EXTENDED?
    • Ryan: This can be done later.
    • Wenchen: DELETE FROM just passes filters to the data source to delete
    • Ryan: Instead of a complicated builder, let’s solve just the simple case (filters) and not the row-level delete case. If we do that, then we can use a simple SupportsDelete interface and put off row-level delete design
    • Consensus was to add a SupportsDelete interface for Table and not a new builder
  • Stats push-down fix:
    • Ryan: briefly looked into it and this can probably be done earlier, in the optimizer by creating a scan early and a special logical plan to wrap a scan. This isn’t a good long-term solution but would fix stats for the release. Write side would not change.
    • Ryan will submit a PR with the implementation
  • Using ALTER TABLE implementations for v1
    • Burak: Took a stab at this, but ran into problems. Would be nice if all DDL for v1 were supported through v2 API
    • DDL doesn’t work with v1 for custom data sources - if the source of truth is not Hive
    • Matt: v2 should be used to change the source of truth. v1 behavior is to only change the session catalog (e.g., Hive).
    • Matt: is v1 deprecated?
    • Wenchen, not until stable
    • Burak: can’t deprecate yet
    • Burak: CTAS and RTAS could also call v1
    • Ryan: We could build a v2 implementation that calls v1, but only append and read could be supported because v1 overwrite behavior is unreliable across sources.
  • Ran out of time
    • Wenchen’s CatalogPlugin changes can be discussed next time
    • Ryan will follow up with Raymond about reusing Parquet read path in other v2 sources
