DataSourceV2 sync, 17 April 2019

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

DataSourceV2 sync, 17 April 2019

Ryan Blue

Here are my notes from the last DSv2 sync. As always:

  • If you’d like to attend the sync, send me an email and I’ll add you to the invite. Everyone is welcome.
  • These notes are what I wrote down and remember. If you have corrections or comments, please reply.

Topics:

Attendees:

Ryan Blue
John Zhuge
Matt Cheah
Yifei Huang
Bruce Robbins
Jamison Bennett
Russell Spitzer
Wenchen Fan
Yuanjian Li

(and others who arrived after the start)

Discussion:

  • TableCatalog PR: https://github.com/apache/spark/pull/24246
    • Wenchen and Matt had just reviewed the PR. Mostly what was in the SPIP so not much discussion of content.
    • Wenchen: Easier to review if the changes to move Table and TableCapability were in a separate PR (mostly import changes)
    • Ryan will open a separate PR for the move [Ed: #24410]
    • Russell: How should caching work? Has hit lots of problems with Spark caching data and getting out of date
    • Ryan: Spark should always call into the catalog and not cache to avoid those problems. However, Spark should ensure that it uses the same instance of a Table for all scans in the same query, for consistent self-joins.
    • Some discussion of self joins. Conclusion was that we don’t need to worry about this yet because it is unlikely.
    • Wenchen: should this include the namespace methods?
    • Ryan: No, those are a separate concern and can be added in a parallel PR.
  • Remove SaveMode PR: https://github.com/apache/spark/pull/24233
    • Wenchen: PR is on hold waiting for streaming capabilities, #24129, because the Noop sink doesn’t validate schema
    • Wenchen will open a PR to add a capability to opt out of schema validation, then come back to this PR.
  • Streaming capabilities PR: https://github.com/apache/spark/pull/24129
    • Ryan: This PR needs validation in the analyzer. The analyzer is where validations should exist, or else validations must be copied into every code path that produces a streaming plan.
    • Wenchen: the write check can’t be written because the write node is never passed to the analyzer. Fixing that is a larger problem.
    • Ryan: Agree that refactoring to pass the write node to the analyzer should be separate.
    • Wenchen: a check to ensure that either microbatch or continuous can be used is hard because some sources may fall back
    • Ryan: By the time this check runs, fallback has happened. Do v1 sources support continuous mode?
    • Wenchen: No, v1 doesn’t support continuous
    • Ryan: Then this can be written to assume that v1 sources only support microbatch mode.
    • Wenchen will add this check
    • Wenchen: the check that tables in a v2 streaming relation support either microbatch or continuous won’t catch anything and are unnecessary
    • Ryan: These checks still need to be in the analyzer so future uses do not break. We had the same problem moving to v2: because schema checks were specific to DataSource code paths, they were overlooked when adding v2. Running validations in the analyzer avoids problems like this.
    • Wenchen will add the validation.
  • Matt: Will v2 be ready in time for the 3.0 release?
    • Ryan: Once #24246 is in, we can work on PRs in parallel, but it is not looking good.
--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: DataSourceV2 sync, 17 April 2019

Jean Georges Perrin
This may be completely inappropriate and I apologize if it is, nevertheless, I am trying to get some clarification about the current status of DS.

Please tell me where I am wrong:

Currently, the stable API is v1.
There is a v2 DS API, but it is not widely used.
The group is working on a “new” v2 API that will be available after the release of Spark v3.

jg

--
Jean Georges Perrin



On Apr 19, 2019, at 10:10, Ryan Blue <[hidden email]> wrote:

Here are my notes from the last DSv2 sync. As always:

  • If you’d like to attend the sync, send me an email and I’ll add you to the invite. Everyone is welcome.
  • These notes are what I wrote down and remember. If you have corrections or comments, please reply.

Topics:

Attendees:

Ryan Blue
John Zhuge
Matt Cheah
Yifei Huang
Bruce Robbins
Jamison Bennett
Russell Spitzer
Wenchen Fan
Yuanjian Li

(and others who arrived after the start)

Discussion:

  • TableCatalog PR: https://github.com/apache/spark/pull/24246
    • Wenchen and Matt had just reviewed the PR. Mostly what was in the SPIP so not much discussion of content.
    • Wenchen: Easier to review if the changes to move Table and TableCapability were in a separate PR (mostly import changes)
    • Ryan will open a separate PR for the move [Ed: #24410]
    • Russell: How should caching work? Has hit lots of problems with Spark caching data and getting out of date
    • Ryan: Spark should always call into the catalog and not cache to avoid those problems. However, Spark should ensure that it uses the same instance of a Table for all scans in the same query, for consistent self-joins.
    • Some discussion of self joins. Conclusion was that we don’t need to worry about this yet because it is unlikely.
    • Wenchen: should this include the namespace methods?
    • Ryan: No, those are a separate concern and can be added in a parallel PR.
  • Remove SaveMode PR: https://github.com/apache/spark/pull/24233
    • Wenchen: PR is on hold waiting for streaming capabilities, #24129, because the Noop sink doesn’t validate schema
    • Wenchen will open a PR to add a capability to opt out of schema validation, then come back to this PR.
  • Streaming capabilities PR: https://github.com/apache/spark/pull/24129
    • Ryan: This PR needs validation in the analyzer. The analyzer is where validations should exist, or else validations must be copied into every code path that produces a streaming plan.
    • Wenchen: the write check can’t be written because the write node is never passed to the analyzer. Fixing that is a larger problem.
    • Ryan: Agree that refactoring to pass the write node to the analyzer should be separate.
    • Wenchen: a check to ensure that either microbatch or continuous can be used is hard because some sources may fall back
    • Ryan: By the time this check runs, fallback has happened. Do v1 sources support continuous mode?
    • Wenchen: No, v1 doesn’t support continuous
    • Ryan: Then this can be written to assume that v1 sources only support microbatch mode.
    • Wenchen will add this check
    • Wenchen: the check that tables in a v2 streaming relation support either microbatch or continuous won’t catch anything and are unnecessary
    • Ryan: These checks still need to be in the analyzer so future uses do not break. We had the same problem moving to v2: because schema checks were specific to DataSource code paths, they were overlooked when adding v2. Running validations in the analyzer avoids problems like this.
    • Wenchen will add the validation.
  • Matt: Will v2 be ready in time for the 3.0 release?
    • Ryan: Once #24246 is in, we can work on PRs in parallel, but it is not looking good.
--
Ryan Blue
Software Engineer
Netflix

Reply | Threaded
Open this post in threaded view
|

Re: DataSourceV2 sync, 17 April 2019

Ryan Blue
That is mostly correct. V2 standardizes the behavior of logical operations like CTAS across data sources, so it isn't compatible with v1 behavior. Consequently, we can't just move to v2 easily. We have to maintain both in parallel and eventually deprecate v1.

We are aiming to have a working v2 in Spark 3.0, but the community has not committed to this goal. Support may be incomplete.

rb

On Sat, Apr 27, 2019 at 7:13 AM Jean Georges Perrin <[hidden email]> wrote:
This may be completely inappropriate and I apologize if it is, nevertheless, I am trying to get some clarification about the current status of DS.

Please tell me where I am wrong:

Currently, the stable API is v1.
There is a v2 DS API, but it is not widely used.
The group is working on a “new” v2 API that will be available after the release of Spark v3.

jg

--
Jean Georges Perrin



On Apr 19, 2019, at 10:10, Ryan Blue <[hidden email]> wrote:

Here are my notes from the last DSv2 sync. As always:

  • If you’d like to attend the sync, send me an email and I’ll add you to the invite. Everyone is welcome.
  • These notes are what I wrote down and remember. If you have corrections or comments, please reply.

Topics:

Attendees:

Ryan Blue
John Zhuge
Matt Cheah
Yifei Huang
Bruce Robbins
Jamison Bennett
Russell Spitzer
Wenchen Fan
Yuanjian Li

(and others who arrived after the start)

Discussion:

  • TableCatalog PR: https://github.com/apache/spark/pull/24246
    • Wenchen and Matt had just reviewed the PR. Mostly what was in the SPIP so not much discussion of content.
    • Wenchen: Easier to review if the changes to move Table and TableCapability were in a separate PR (mostly import changes)
    • Ryan will open a separate PR for the move [Ed: #24410]
    • Russell: How should caching work? Has hit lots of problems with Spark caching data and getting out of date
    • Ryan: Spark should always call into the catalog and not cache to avoid those problems. However, Spark should ensure that it uses the same instance of a Table for all scans in the same query, for consistent self-joins.
    • Some discussion of self joins. Conclusion was that we don’t need to worry about this yet because it is unlikely.
    • Wenchen: should this include the namespace methods?
    • Ryan: No, those are a separate concern and can be added in a parallel PR.
  • Remove SaveMode PR: https://github.com/apache/spark/pull/24233
    • Wenchen: PR is on hold waiting for streaming capabilities, #24129, because the Noop sink doesn’t validate schema
    • Wenchen will open a PR to add a capability to opt out of schema validation, then come back to this PR.
  • Streaming capabilities PR: https://github.com/apache/spark/pull/24129
    • Ryan: This PR needs validation in the analyzer. The analyzer is where validations should exist, or else validations must be copied into every code path that produces a streaming plan.
    • Wenchen: the write check can’t be written because the write node is never passed to the analyzer. Fixing that is a larger problem.
    • Ryan: Agree that refactoring to pass the write node to the analyzer should be separate.
    • Wenchen: a check to ensure that either microbatch or continuous can be used is hard because some sources may fall back
    • Ryan: By the time this check runs, fallback has happened. Do v1 sources support continuous mode?
    • Wenchen: No, v1 doesn’t support continuous
    • Ryan: Then this can be written to assume that v1 sources only support microbatch mode.
    • Wenchen will add this check
    • Wenchen: the check that tables in a v2 streaming relation support either microbatch or continuous won’t catch anything and are unnecessary
    • Ryan: These checks still need to be in the analyzer so future uses do not break. We had the same problem moving to v2: because schema checks were specific to DataSource code paths, they were overlooked when adding v2. Running validations in the analyzer avoids problems like this.
    • Wenchen will add the validation.
  • Matt: Will v2 be ready in time for the 3.0 release?
    • Ryan: Once #24246 is in, we can work on PRs in parallel, but it is not looking good.
--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix