DataSourceV2 sync notes - 15 May 2019

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

DataSourceV2 sync notes - 15 May 2019

Ryan Blue

Sorry these notes are so late, I didn’t get to the write up until now. As usual, if anyone has corrections or comments, please reply.

Attendees:

John Zhuge
Ryan Blue
Andrew Long
Wenchen Fan
Gengliang Wang
Russell Spitzer
Yuanjian Li
Yifei Huang
Matt Cheah
Amardeep Singh Dhilon
Zhilmil Dhion
Ryan Pifer

Topics:

Discussion:

  • Wenchen: When will we add select support?
    • John: working in resolution. DSv2 resolution is straight-forward, the difficulty is ensuring a smooth transition from v1 to v2.
    • Ryan: table resolution will also be used for inserts. Once select is done, insert is next.
    • John: the PR may include insert as well
  • Add default v2 catalog:
    • Ryan: A default catalog is needed fro CTAS support when the source is v2
    • Ryan: A pass-through v2 catalog that uses SessionCatalog should be available as the default
  • FunctionCatalog API:
    • Wenchen: this should have a design doc
    • Ryan: Agreed. The PR is for early discussion and prototyping.
  • Bucketed joins: [Ed: I don’t remember much of this, feel free to expand what was said]
    • Andrew: looks like lots of work to be done for bucketing. Sort removals aren’t done, bucketing with non-bucketed tables still incurs hashing costs.
    • Ryan: work on support for Hive bucketing appears to have stopped, so it doesn’t look like this is an easy area to improve
    • Where should join optimization be done?
    • Andrew will create a prototype PR.
  • Case sensitivity in catalogs: should catalogs report case sensitivity to Spark?
    • Ryan: catalogs connect to external systems so Spark can’t impose case sensitivity requirements. A catalog is case sensitive or not and would only be forced to violate Spark’s assumption.
    • Ryan: requiring a catalog to report whether it is case sensitive doesn’t actually help Spark. If the catalog is case sensitive, then Spark should pass exactly what it received to avoid changing the meaning. If the catalog is case insensitive, then Spark can pass exactly what it received because case is handled in the catalog. So Spark’s behavior doesn’t change.
    • Russel: not all catalogs are case sensitive or case insensitive. Some are case insensitive unless an identifier is quoted. Quoted parts are case sensitive.
    • Ryan: So a catalog would not be able to return true or false correctly.
    • Conclusion: Spark should pass identifiers that it received, without modification.
--
Ryan Blue
Software Engineer
Netflix