DSv2 sync notes - 11 December 2019

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

DSv2 sync notes - 11 December 2019

Ryan Blue

Hi everyone, here are my notes for the DSv2 sync last week. Sorry they’re late! Feel free to add more details or corrections. Thanks!

rb

Attendees:

Ryan Blue
John Zhuge
Dongjoon Hyun
Joseph Torres
Kevin Yu
Russel Spitzer
Terry Kim
Wenchen Fan
Hyukjin Kwan
Jacky Lee

Topics:

Discussion:

  • User-specified schema handling
    • Burak: User-specified schema should
  • Relation resolution behavior (see doc link above)
    • Ryan: Thanks to Terry for fixing table resolution: https://github.com/apache/spark/pull/26684, next step is to clean up temp views
    • Terry: the idea is to always resolve identifiers the same way and not to resolve a temp view in some cases but not others. If an identifier is a temp view and is used in a context where you can’t use a view, it should fail instead of finding a table.
    • Ryan: does this need to be done by 3.0?
    • Consensus was that it should be done for 3.0
    • Ryan: not much activity on the dev list thread for this. Do we move forward anyway?
    • Wenchen: okay to fix because the scope is small
    • Consensus was to go ahead and notify the dev list about changes because this is a low-risk case that does not occur often (table and temp view conflict)
    • Burak: cached tables are similar: for insert you get the new results.
    • Ryan: is that the same issue or a similar problem to fix?
    • Burak: similar, it can be done separately
    • Ryan: does this also need to be fixed by 3.0?
    • Wenchen: it is a blocker (Yes). Spark should invalidate the cached table after a write
    • Ryan: There’s another issue: how do we handle a permanent view with a name that resolves to a temp view? If incorrect, this changes the results of a stored view.
    • Wenchen: This is currently broken, Spark will resolve the relation as a temp view. But Spark could use the analysis context to fix this.
    • Ryan: We should fix this when fixing temp views.
  • Nested schema pruning:
    • Dongjoon: Nested schema pruning was only done for Parquet and ORC instead of all v2 sources. Anton submitted a PR that fixes it.
    • At the time, the PR had +1s and was pending some minor discussion. It was merged the next day.
  • TableProvider changes:

    • Wenchen: Spark always calls 1 method to load a table. The implementation can do schema and partition inference in that method. Forcing this to be separated into other methods causes problems in the file source. FileIndex is used for all these tasks.
    • Ryan: I’m not sure that existing file source code is a good enough justification to change the proposed API. Seems too path dependent.
    • Ryan: It is also strange to have the source of truth for schema information differ between code paths. Some getTable uses would pass the schema to the source (from metastore) with TableProvider, but some would instead rely on the table from getTable to provide its own schema. This is confusing to implementers.

    • Burak: The default mode in DataFrameWriter is ErrorIfExists, which doesn’t currently work with v2 sources. Moving from Kafka to KafkaV2, for example, would probably break.

    • Ryan: So do we want to get extractCatalog and extractIdentifier into 3.0? Or is this blocked by the infer changes?
    • Burak: It would be good to have.
    • Wenchen: Schema may be inferred, or provided by Spark
    • Ryan: Sources should specify whether they accept a user-specified schema. But either way, the schema is still external and passed into the table. The main decision is whether all cases (inference included) should pass the schema into the table.
  • Tasks for 3.0
    • Decided to get temp view resolution fixed
    • Decided to get TableProvider changes in
    • extractCatalog/extractIdentifier are nice-to-have (but small)
    • Burak: Upgrading to v2 saveAsTable from DataFrameWriter v1 creates RTAS, but in Delta v1 would only overwrite the schema if requested. It would be nice to be able to select
    • Ryan: Standardizing behavior (replace vs truncate vs dynamic overwrite) is a main point of v2. Allowing sources to choose their own behavior is not supported in v2 so that we can guarantee consistent semantics across tables. Making a way for Delta to change its semantics doesn’t seem like a good idea. To have identical semantics, use Delta as a v1 source.
  • Spark 3.1 goals?
    • Ryan: I suggest metadata columns and metadata tables (e.g. partitions) for metadata query optimization
    • Wenchen: ViewCatalog and FunctionCatalog
    • Ryan: will send out a doc on FunctionCatalog when I can write it up
    • Ryan: also MERGE INTO. this will require returning a required distribution and sort order.
--
Ryan Blue
Software Engineer
Netflix