Ryan Blue Terry Kim Wenchen Fan Jose Torres Jacky Lee Gengliang Wang
DROP NAMESPACE cascade behavior
TableProvider API changes
V1 and V2 table resolution rules
Separate logical and physical write (for streaming)
Bucketing support (if time)
DROP NAMESPACE cascade
Terry: How should the cascade option be handled?
Ryan: The API currently requires failing when the namespace is non-empty; the intent is for Spark to handle the complexity of recursive deletes
Wenchen: That will be slow because Spark has to list and issue individual delete calls.
Ryan: What about changing this so that DROP is always a recursive drop? Then Spark can check all implemented features (views for ViewCatalog, tables for TableCatalog) and we don’t need to add more calls and args.
Consensus was to update dropNamespace so that it is always cascading, so implementations can speed up the operation. Spark will check whether a namespace is empty and not issue the call if it is non-empty or the query was not cascading.
Remaining 3.0 tasks:
Add inferSchema and inferPartitioning to TableProvider (#26297)
Add catalog and identifier methods so that DataFrameWriter can support ErrorIfExists and Ignore modes
Wenchen: tables need both schema and partitioning. Sometimes these are provided but not always. Currently, they are inferred if not provided, but this is implicit based on whether they are passed.
Wenchen: A better API is to add inferSchema and inferPartitioning that are separate from getTable, so they are always explicitly passed to getTable.
Wenchen: the only problem is on the write path, where inference is not currently done for path-based tables. The PR has a special case to skip inference in this case.
Ryan: Sounds okay, will review soon.
Ryan: Why is inference so expensive?
Wenchen: No validation on write means extra validation is needed to read. All file schemas should be used to ensure compatibility. Partitioning is similar: more examples are needed to determine partition column types.
Ryan: we found that the v1 and v2 rules are order dependent. Wenchen has a PR, but it rewrites the v1 ResolveRelations rule. That’s concerning because we don’t want to risk breaking v1 in 3.0. So we need to find a work-around
Wenchen: Burak suggested a work-around that should be a good approach
Ryan: Agreed. And in the long term, I don’t think we want to mix view and table resolution. View resolution is complicated because it requires context (e.g., current db). But it shouldn’t be necessary to resolve tables at the same time. Identifiers can be rewritten to avoid this. We should also consider moving view resolution into an earlier batch. In that case, view resolution would happen in a fixed-point batch and it wouldn’t need the custom recursive code.
Ryan: Can permanent views resolve temporary views? If not, we can move temporary views sooner, which would help simplify the v2 resolution rules.
Separating logical and physical writes
Wenchen: there is a use case to add physical information to streaming writes, like parallelism. The way streaming is written, it makes sense to separate writes into logical and physical stages, like the read side with Scan and Batch.
Ryan: So this would create separate Write and Batch objects? Would this move epoch ID to the creation of a batch write?
Wenchen: maybe. Will write up a design doc. Goal is to get this into Spark 3.0 if possible
Ryan: Okay, but I think TableProvider is still high priority for the 3.0 work