DataSourceV2 community sync notes - 1 May 2019

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

DataSourceV2 community sync notes - 1 May 2019

Ryan Blue

Here are my notes for the latest DSv2 community sync. As usual, if you have comments or corrections, please reply. If you’d like to be invited to the next sync, email me directly. Everyone is welcome to attend.

Attendees:
Ryan Blue
John Zhuge
Andrew Long
Bruce Robbins
Dilip Biswal
Gengliang Wang
Kevin Yu
Michael Artz
Russel Spitzer
Yifei Huang
Zhilmil Dhillon

Topics:

Introductions
Open pull requests
V2 organization
Bucketing and sort order from v2 sources

Discussion:

  • Introductions: we stopped doing introductions when we had a large group. Attendance has gone down from the first few syncs, so we decided to resume.
  • V2 organization / PR #24416: https://github.com/apache/spark/pull/24416
    • Ryan: There’s an open PR to move v2 into catalyst
    • Andrew: What is the distinction between catalyst and sql? How do we know what goes where?
    • Ryan: IIUC, the catalyst module is supposed to be a stand-alone query planner that doesn’t get into concrete physical plans. The catalyst package is the private implementation. Anything that is generic catalyst, including APIs like DataType, should be in the catalyst module. Anything public, like an API, should not be in the catalyst package.
    • Ryan: v2 meets those requirements now and I don’t have a strong opinion on organization. We just need to choose one.
    • No one had a strong opinion so we tabled this. In #24416 or shortly after let’s decide on organization and do the move at once.
    • Next steps: someone with an opinion on organization should suggest a structure.
  • TableCatalog API / PR #24246: https://github.com/apache/spark/pull/24246
    • Ryan: Wenchen’s last comment was that he was waiting for tests to pass and they are. Maybe this will be merged soon?
  • Bucketing and sort order from v2 sources
    • Andrew: interested in using existing data layout and sorting to remove expensive tasks in joins
    • Ryan: in v2, bucketing is unified with other partitioning functions. I plan to build a way for Spark to get partition function implementations from a source so it can use that function to prepare the other side of a join. From there, I have been thinking about a way to check compatibility between functions, so we could validate that table A has the same bucketing as table B.
    • Dilip: is bucketing Hive-specific?
    • Russel: Cassandra also buckets
    • Matt: what is the difference between bucketing and other partition functions for this?
    • Ryan: probably no difference. If you’re partitioning by hour, you could probably use that, too.
    • Dilip: how can Spark compare functions?
    • Andrew: should we introduce a standard? it would be easy to switch for their use case
    • Ryan: it is difficult to introduce a standard because so much data already exists in tables. I think it is easier to support multiple functions.
    • Russel: Cassandra uses dynamic bucketing, which wouldn’t be able to use a standard.
    • Dilip: sources could push down joins
    • Russel: that’s a harder problem
    • Andrew: does anyone else limit bucket size?
    • Ryan: we don’t because we assume the sort can spill. probably a good optimization for later
    • Matt: what are the follow-up items for this?
    • Andrew: will look into the current state of bucketing in Spark
    • Ryan: it would be great if someone thought about what the FunctionCatalog interface will look like
--
Ryan Blue
Software Engineer
Netflix