Here are my notes for the latest DSv2 community sync. As usual, if you have comments or corrections, please reply. If you’d like to be invited to the next sync, email me directly. Everyone is welcome to attend.
Attendees: Ryan Blue John Zhuge Andrew Long Bruce Robbins Dilip Biswal Gengliang Wang Kevin Yu Michael Artz Russel Spitzer Yifei Huang Zhilmil Dhillon
Introductions Open pull requests V2 organization Bucketing and sort order from v2 sources
Introductions: we stopped doing introductions when we had a large group. Attendance has gone down from the first few syncs, so we decided to resume.
Andrew: What is the distinction between catalyst and sql? How do we know what goes where?
Ryan: IIUC, the catalyst module is supposed to be a stand-alone query planner that doesn’t get into concrete physical plans. The catalyst package is the private implementation. Anything that is generic catalyst, including APIs like DataType, should be in the catalyst module. Anything public, like an API, should not be in the catalyst package.
Ryan: v2 meets those requirements now and I don’t have a strong opinion on organization. We just need to choose one.
No one had a strong opinion so we tabled this. In #24416 or shortly after let’s decide on organization and do the move at once.
Next steps: someone with an opinion on organization should suggest a structure.
Ryan: Wenchen’s last comment was that he was waiting for tests to pass and they are. Maybe this will be merged soon?
Bucketing and sort order from v2 sources
Andrew: interested in using existing data layout and sorting to remove expensive tasks in joins
Ryan: in v2, bucketing is unified with other partitioning functions. I plan to build a way for Spark to get partition function implementations from a source so it can use that function to prepare the other side of a join. From there, I have been thinking about a way to check compatibility between functions, so we could validate that table A has the same bucketing as table B.
Dilip: is bucketing Hive-specific?
Russel: Cassandra also buckets
Matt: what is the difference between bucketing and other partition functions for this?
Ryan: probably no difference. If you’re partitioning by hour, you could probably use that, too.
Dilip: how can Spark compare functions?
Andrew: should we introduce a standard? it would be easy to switch for their use case
Ryan: it is difficult to introduce a standard because so much data already exists in tables. I think it is easier to support multiple functions.
Russel: Cassandra uses dynamic bucketing, which wouldn’t be able to use a standard.
Dilip: sources could push down joins
Russel: that’s a harder problem
Andrew: does anyone else limit bucket size?
Ryan: we don’t because we assume the sort can spill. probably a good optimization for later
Matt: what are the follow-up items for this?
Andrew: will look into the current state of bucketing in Spark
Ryan: it would be great if someone thought about what the FunctionCatalog interface will look like