DSv2 sync notes - 30 October 2019

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

DSv2 sync notes - 30 October 2019

Ryan Blue

Attendees:

Ryan Blue
Terry Kim
Wenchen Fan
Jose Torres
Jacky Lee
Gengliang Wang

Topics:

  • DROP NAMESPACE cascade behavior
  • 3.0 tasks
  • TableProvider API changes
  • V1 and V2 table resolution rules
  • Separate logical and physical write (for streaming)
  • Bucketing support (if time)
  • Open PRs

Discussion:

  • DROP NAMESPACE cascade
    • Terry: How should the cascade option be handled?
    • Ryan: The API currently requires failing when the namespace is non-empty; the intent is for Spark to handle the complexity of recursive deletes
    • Wenchen: That will be slow because Spark has to list and issue individual delete calls.
    • Ryan: What about changing this so that DROP is always a recursive drop? Then Spark can check all implemented features (views for ViewCatalog, tables for TableCatalog) and we don’t need to add more calls and args.
    • Consensus was to update dropNamespace so that it is always cascading, so implementations can speed up the operation. Spark will check whether a namespace is empty and not issue the call if it is non-empty or the query was not cascading.
  • Remaining 3.0 tasks:
    • Add inferSchema and inferPartitioning to TableProvider (#26297)
    • Add catalog and identifier methods so that DataFrameWriter can support ErrorIfExists and Ignore modes
  • TableProvider changes:
    • Wenchen: tables need both schema and partitioning. Sometimes these are provided but not always. Currently, they are inferred if not provided, but this is implicit based on whether they are passed.
    • Wenchen: A better API is to add inferSchema and inferPartitioning that are separate from getTable, so they are always explicitly passed to getTable.
    • Wenchen: the only problem is on the write path, where inference is not currently done for path-based tables. The PR has a special case to skip inference in this case.
    • Ryan: Sounds okay, will review soon.
    • Ryan: Why is inference so expensive?
    • Wenchen: No validation on write means extra validation is needed to read. All file schemas should be used to ensure compatibility. Partitioning is similar: more examples are needed to determine partition column types.
  • Resolution rules
    • Ryan: we found that the v1 and v2 rules are order dependent. Wenchen has a PR, but it rewrites the v1 ResolveRelations rule. That’s concerning because we don’t want to risk breaking v1 in 3.0. So we need to find a work-around
    • Wenchen: Burak suggested a work-around that should be a good approach
    • Ryan: Agreed. And in the long term, I don’t think we want to mix view and table resolution. View resolution is complicated because it requires context (e.g., current db). But it shouldn’t be necessary to resolve tables at the same time. Identifiers can be rewritten to avoid this. We should also consider moving view resolution into an earlier batch. In that case, view resolution would happen in a fixed-point batch and it wouldn’t need the custom recursive code.
    • Ryan: Can permanent views resolve temporary views? If not, we can move temporary views sooner, which would help simplify the v2 resolution rules.
  • Separating logical and physical writes
    • Wenchen: there is a use case to add physical information to streaming writes, like parallelism. The way streaming is written, it makes sense to separate writes into logical and physical stages, like the read side with Scan and Batch.
    • Ryan: So this would create separate Write and Batch objects? Would this move epoch ID to the creation of a batch write?
    • Wenchen: maybe. Will write up a design doc. Goal is to get this into Spark 3.0 if possible
    • Ryan: Okay, but I think TableProvider is still high priority for the 3.0 work
    • Wenchen: Agreed.
--
Ryan Blue
Software Engineer
Netflix