DataSourceV2 sync notes (#4)

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

DataSourceV2 sync notes (#4)

Ryan Blue

Hi everyone, sorry these notes are late. I didn’t have the time to write this up last week.

For anyone interested in the next sync, we decided to skip next week and resume in early January. I’ve already sent the invite. As usual, if you have topics you’d like to discuss or would like to be added to the invite list, just let me know. Everyone is welcome.

rb

Attendees:
Ryan Blue
Xiao Li
Bruce Robbins
John Zhuge
Anton Okolnychyi
Jackey Lee
Jamison Bennett
Srabasti Banerjee
Thomas D’Silva
Wenchen Fan
Matt Cheah
Maryann Xue
(possibly others that entered after the start)

Agenda:

  • Current discussions from the v2 batch write PR: WriteBuilder and SaveMode
  • Continue sql-api discussion after looking at API dependencies
  • Capabilities API
  • Overview of TableCatalog proposal to sync understanding (if time)

Notes:

  • WriteBuilder:
    • Wenchen summarized the options (factory methods vs builder) and some trade-offs
    • What we need to accomplish now can be done with factory methods, which are simpler
    • A builder matches the structure of the read side
    • Ryan’s opinion is to use the builder for consistency and evolution. Builder makes it easier to change or remove parts without copying all of the args of a method.
    • Matt’s opinion is that evolution and maintenance is easier and good to match the read side
    • Consensus was to use WriteBuilder instead of factory methods
  • SaveMode:
    • Context: v1 passes SaveMode from the DataFrameWriter API to sources. The action taken for some mode and existing table state depends on the source implementation, which is something the community wants to fix in v2. But, v2 initially passed SaveMode to sources. The question is how and when to remove SaveMode.
    • Wenchen: the current API uses SaveMode and we don’t want to drop features
    • Ryan: The main requirement is removing this before the next release. We should not have a substantial API change without removing it because we would still require an API change.
    • Xiao: suggested creating a release-blocking issue.
    • Consensus was to remove SaveMode before the next release, blocking if needed.
    • Someone also stated that keeping SaveMode would make porting file sources to v2 easier
    • Ryan disagrees that using SaveMode makes porting file sources faster or easier.
  • Capatbilities API (this is a quick overview of a long conversation)
    • Context: there are several situations where a source needs to change how Spark behaves or Spark needs to check whether a source supports some feature. For example, Spark checks whether a source supports batch writes, write-only sources that do not need validation need to tell Spark not to run validation rules, and sources that can read files with missing columns (e.g., Iceberg) need Spark to allow writes that are missing columns if those columns are optional or have default values.
    • Xiao suggested handling this case by case and the conversation moved to discussing the motivating case for Netflix: allowing writes that do not include optional columns.
    • Wenchen and Maryann added that Spark should handle all default values so that this doesn’t differ across sources. Ryan agreed that would be good, but pointed out challenges.
    • There was a long discussion about how Spark could handle default values. The problem is that adding a column with a default creates a problem of reading older data. Maryann and Dilip pointed out that traditional databases handle default values at write time so the correct default is the default value at write time (instead of read time), but it is unclear how existing data is handled.
    • Matt and Ryan asked whether databases update existing rows when a default is added. But even if a database can update all existing rows, that would not be reasonable for Spark, which in the worst case would need to update millions of immutable files. This is also not a reasonable requirement to put on sources, so Spark would need to have read-side defaults.
    • Xiao noted that it may be easier to treat internal and external sources differently so internal sources to handle defaults. Ryan pointed out that this is the motivation for adding a capability API.
    • Consensus was to start a discuss thread on the dev list about default values.
    • Discussion shifted to a different example: the need to disable validation for write-only tables. Consensus was that this use case is valid.
    • Wenchen: capabilities would work to disable write validation, but should not be string based.
    • Consensus was to use a capabilities API, but use an enum instead of strings.
    • Open question: what other options should use a capabilities API?
--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: DataSourceV2 sync notes (#4)

Srabasti Banerjee
Thanks for sending out the meeting notes from last week's discussion Ryan!

For technical unknown reasons, I could not unmute myself and be heard when I was trying to pitch in during one of the topic discussions regarding default value handling for traditional databases. Had posted response in chat.

My 2 cents regarding traditional database handling for default values - From my industry experience, Oracle has a constraint clause "ENABLE NOVALIDATE" that enables new rows to be added going forward to be added with default value.  Previous older rows/data are not required to be updated a default value. One can choose to do a data fix, at any point though.

Happy Holidays All in advance :-)

Warm Regards,
Srabasti Banerjee

On Tuesday, 18 December, 2018, 4:15:06 PM GMT-8, Ryan Blue <[hidden email]> wrote:


Hi everyone, sorry these notes are late. I didn’t have the time to write this up last week.

For anyone interested in the next sync, we decided to skip next week and resume in early January. I’ve already sent the invite. As usual, if you have topics you’d like to discuss or would like to be added to the invite list, just let me know. Everyone is welcome.

rb

Attendees:
Ryan Blue
Xiao Li
Bruce Robbins
John Zhuge
Anton Okolnychyi
Jackey Lee
Jamison Bennett
Srabasti Banerjee
Thomas D’Silva
Wenchen Fan
Matt Cheah
Maryann Xue
(possibly others that entered after the start)

Agenda:

  • Current discussions from the v2 batch write PR: WriteBuilder and SaveMode
  • Continue sql-api discussion after looking at API dependencies
  • Capabilities API
  • Overview of TableCatalog proposal to sync understanding (if time)

Notes:

  • WriteBuilder:
    • Wenchen summarized the options (factory methods vs builder) and some trade-offs
    • What we need to accomplish now can be done with factory methods, which are simpler
    • A builder matches the structure of the read side
    • Ryan’s opinion is to use the builder for consistency and evolution. Builder makes it easier to change or remove parts without copying all of the args of a method.
    • Matt’s opinion is that evolution and maintenance is easier and good to match the read side
    • Consensus was to use WriteBuilder instead of factory methods
  • SaveMode:
    • Context: v1 passes SaveMode from the DataFrameWriter API to sources. The action taken for some mode and existing table state depends on the source implementation, which is something the community wants to fix in v2. But, v2 initially passed SaveMode to sources. The question is how and when to remove SaveMode.
    • Wenchen: the current API uses SaveMode and we don’t want to drop features
    • Ryan: The main requirement is removing this before the next release. We should not have a substantial API change without removing it because we would still require an API change.
    • Xiao: suggested creating a release-blocking issue.
    • Consensus was to remove SaveMode before the next release, blocking if needed.
    • Someone also stated that keeping SaveMode would make porting file sources to v2 easier
    • Ryan disagrees that using SaveMode makes porting file sources faster or easier.
  • Capatbilities API (this is a quick overview of a long conversation)
    • Context: there are several situations where a source needs to change how Spark behaves or Spark needs to check whether a source supports some feature. For example, Spark checks whether a source supports batch writes, write-only sources that do not need validation need to tell Spark not to run validation rules, and sources that can read files with missing columns (e.g., Iceberg) need Spark to allow writes that are missing columns if those columns are optional or have default values.
    • Xiao suggested handling this case by case and the conversation moved to discussing the motivating case for Netflix: allowing writes that do not include optional columns.
    • Wenchen and Maryann added that Spark should handle all default values so that this doesn’t differ across sources. Ryan agreed that would be good, but pointed out challenges.
    • There was a long discussion about how Spark could handle default values. The problem is that adding a column with a default creates a problem of reading older data. Maryann and Dilip pointed out that traditional databases handle default values at write time so the correct default is the default value at write time (instead of read time), but it is unclear how existing data is handled.
    • Matt and Ryan asked whether databases update existing rows when a default is added. But even if a database can update all existing rows, that would not be reasonable for Spark, which in the worst case would need to update millions of immutable files. This is also not a reasonable requirement to put on sources, so Spark would need to have read-side defaults.
    • Xiao noted that it may be easier to treat internal and external sources differently so internal sources to handle defaults. Ryan pointed out that this is the motivation for adding a capability API.
    • Consensus was to start a discuss thread on the dev list about default values.
    • Discussion shifted to a different example: the need to disable validation for write-only tables. Consensus was that this use case is valid.
    • Wenchen: capabilities would work to disable write validation, but should not be string based.
    • Consensus was to use a capabilities API, but use an enum instead of strings.
    • Open question: what other options should use a capabilities API?
--
Ryan Blue
Software Engineer
Netflix