DSv2 sync - 4 September 2019

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

DSv2 sync - 4 September 2019

Ryan Blue

Here are my notes from the latest sync. Feel free to reply with clarifications if I’ve missed anything.

Attendees:

Ryan Blue
John Zhuge
Russell Spitzer
Matt Cheah
Gengliang Wang
Priyanka Gomatam
Holden Karau

Topics:

Discussion:

  • DataFrameWriterV2 insert vs append discussion recapped the agreement from last sync
  • ANSI and strict modes for inserting casts:
    • Russell: Failure modes are important. ANSI behavior is to fail at runtime, not analysis time. If a cast is allowed, but doesn’t throw an exception at runtime then this can’t be considered ANSI behavior.
    • Gengliang: ANSI adds the cast
    • Matt: Sounds like there are two conflicting views of the world. Is the default ANSI behavior to insert a cast that may produce NULL or to fail at runtime?
    • Ryan: So analysis and runtime behaviors can’t be separate?
    • Matt: Analysis behavior is influenced by behavior at runtime. Maybe the vote should cover both?
    • Russell: (linked to the standard) There are 3 steps: if numeric and same type, use the data value. If the value can be rounded or truncated, round or truncate. Otherwise, throw an exception that a value can’t be cast. These are runtime requirements.
    • Ryan: Another consideration is that we can make Spark more permissive, but can’t make Spark more strict in future releases.
    • Matt: v1 silently corrupts data
    • Russell: ANSI is fine, as long as the runtime matches (is ANSI). Don’t tell people it’s ANSI and not do ANSI completely.
    • Gengliang: people are concerned about long-running jobs failing at the end
    • Ryan: That’s okay because they can change the defaults: use strict analysis-time validation, or allow casts to produce NULL values.
    • Matt: As long as this is well documented, it should be fine
    • Ryan: Can we run tests to find out what exactly the behavior is?
    • Gengliang: sqlfiddle.com
    • Russell ran tests in MySQL and Postgres. Both threw runtime failures.
    • Matt: Let’s move on, but add the runtime behavior to the VOTE
  • Identifier resolution and table lookup
    • Ryan: recent changes merged identifier resolution and table lookup together because identifiers owned by the session catalog need to be loaded to find out whether to use v1 or v2 plans. I think this should be separated so that identifier resolution happens independently to ensure that the two separate tasks don’t end up getting done at the same time and over-complicating the analyzer.
  • SHOW NAMESPACES - Ready for final review
  • DataFrameWriterV2:
    • Ryan: Tests failed after passing on the PR. Anyone know why that would happen?
    • Gengliang: tests failed in maven
    • Holden: PR validation runs SBT tests
  • TableProvider API update: skipped because Wenchen didn’t make it
  • UPDATE support PR
    • Ryan: There is a PR to add a SQL UPDATE command, but it delegates entirely to the data source, which seems strange.
    • Matt: What is Spark’s purpose here? Why would Spark parse a SQL statement only to pass it entirely to another engine?
    • Ryan: It does make sense to do this. If Spark eventually supports MERGE INTO and other row-level operations, then it makes sense to push down the operation to some sources, like JDBC. I just find it backward to add the pushdown API before adding an implementation that handles this inside Spark — pushdown is usually an optimization.
    • Russell: Would this be safe? Spark retries lots of operations.
    • Ryan: I think it would be safe because Spark won’t retry top-level operations and this is a single method call. Nothing would get retried.
    • Ryan: I’ll ask what the PR author’s use case is. Maybe that would help clarify why this is a good idea.
--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: DSv2 sync - 4 September 2019

Nicholas Chammas
A quick question about failure modes, as a casual observer of the DSv2 effort:

I was considering filing a JIRA ticket about enhancing the DataFrameReader to include the failure reason in addition to the corrupt record when the mode is PERMISSIVE. So if you are loading a CSV, for example, and a value cannot be automatically cast to the type you specify in the schema, you'll get the corrupt record in the column configured by columnNameOfCorruptRecord, but you'll also get some detail about what exactly made the record corrupt, perhaps in a new column specified by something like columnNameOfCorruptReason.

Is this an enhancement that would be possible in DSv2?

On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue <[hidden email]> wrote:

Here are my notes from the latest sync. Feel free to reply with clarifications if I’ve missed anything.

Attendees:

Ryan Blue
John Zhuge
Russell Spitzer
Matt Cheah
Gengliang Wang
Priyanka Gomatam
Holden Karau

Topics:

Discussion:

  • DataFrameWriterV2 insert vs append discussion recapped the agreement from last sync
  • ANSI and strict modes for inserting casts:
    • Russell: Failure modes are important. ANSI behavior is to fail at runtime, not analysis time. If a cast is allowed, but doesn’t throw an exception at runtime then this can’t be considered ANSI behavior.
    • Gengliang: ANSI adds the cast
    • Matt: Sounds like there are two conflicting views of the world. Is the default ANSI behavior to insert a cast that may produce NULL or to fail at runtime?
    • Ryan: So analysis and runtime behaviors can’t be separate?
    • Matt: Analysis behavior is influenced by behavior at runtime. Maybe the vote should cover both?
    • Russell: (linked to the standard) There are 3 steps: if numeric and same type, use the data value. If the value can be rounded or truncated, round or truncate. Otherwise, throw an exception that a value can’t be cast. These are runtime requirements.
    • Ryan: Another consideration is that we can make Spark more permissive, but can’t make Spark more strict in future releases.
    • Matt: v1 silently corrupts data
    • Russell: ANSI is fine, as long as the runtime matches (is ANSI). Don’t tell people it’s ANSI and not do ANSI completely.
    • Gengliang: people are concerned about long-running jobs failing at the end
    • Ryan: That’s okay because they can change the defaults: use strict analysis-time validation, or allow casts to produce NULL values.
    • Matt: As long as this is well documented, it should be fine
    • Ryan: Can we run tests to find out what exactly the behavior is?
    • Gengliang: sqlfiddle.com
    • Russell ran tests in MySQL and Postgres. Both threw runtime failures.
    • Matt: Let’s move on, but add the runtime behavior to the VOTE
  • Identifier resolution and table lookup
    • Ryan: recent changes merged identifier resolution and table lookup together because identifiers owned by the session catalog need to be loaded to find out whether to use v1 or v2 plans. I think this should be separated so that identifier resolution happens independently to ensure that the two separate tasks don’t end up getting done at the same time and over-complicating the analyzer.
  • SHOW NAMESPACES - Ready for final review
  • DataFrameWriterV2:
    • Ryan: Tests failed after passing on the PR. Anyone know why that would happen?
    • Gengliang: tests failed in maven
    • Holden: PR validation runs SBT tests
  • TableProvider API update: skipped because Wenchen didn’t make it
  • UPDATE support PR
    • Ryan: There is a PR to add a SQL UPDATE command, but it delegates entirely to the data source, which seems strange.
    • Matt: What is Spark’s purpose here? Why would Spark parse a SQL statement only to pass it entirely to another engine?
    • Ryan: It does make sense to do this. If Spark eventually supports MERGE INTO and other row-level operations, then it makes sense to push down the operation to some sources, like JDBC. I just find it backward to add the pushdown API before adding an implementation that handles this inside Spark — pushdown is usually an optimization.
    • Russell: Would this be safe? Spark retries lots of operations.
    • Ryan: I think it would be safe because Spark won’t retry top-level operations and this is a single method call. Nothing would get retried.
    • Ryan: I’ll ask what the PR author’s use case is. Maybe that would help clarify why this is a good idea.
--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: DSv2 sync - 4 September 2019

cloud0fan
Hi Nicholas,

You are talking about a different thing. The PERMISSIVE mode is the failure mode for reading text-based data source (json, csv, etc.). It's not the general failure mode for Spark table insertion.

I agree with you that the PERMISSIVE mode is hard to use. Feel free to open a JIRA ticket if you have some better ideas.

Thanks,
Wenchen

On Mon, Sep 9, 2019 at 12:46 AM Nicholas Chammas <[hidden email]> wrote:
A quick question about failure modes, as a casual observer of the DSv2 effort:

I was considering filing a JIRA ticket about enhancing the DataFrameReader to include the failure reason in addition to the corrupt record when the mode is PERMISSIVE. So if you are loading a CSV, for example, and a value cannot be automatically cast to the type you specify in the schema, you'll get the corrupt record in the column configured by columnNameOfCorruptRecord, but you'll also get some detail about what exactly made the record corrupt, perhaps in a new column specified by something like columnNameOfCorruptReason.

Is this an enhancement that would be possible in DSv2?

On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue <[hidden email]> wrote:

Here are my notes from the latest sync. Feel free to reply with clarifications if I’ve missed anything.

Attendees:

Ryan Blue
John Zhuge
Russell Spitzer
Matt Cheah
Gengliang Wang
Priyanka Gomatam
Holden Karau

Topics:

Discussion:

  • DataFrameWriterV2 insert vs append discussion recapped the agreement from last sync
  • ANSI and strict modes for inserting casts:
    • Russell: Failure modes are important. ANSI behavior is to fail at runtime, not analysis time. If a cast is allowed, but doesn’t throw an exception at runtime then this can’t be considered ANSI behavior.
    • Gengliang: ANSI adds the cast
    • Matt: Sounds like there are two conflicting views of the world. Is the default ANSI behavior to insert a cast that may produce NULL or to fail at runtime?
    • Ryan: So analysis and runtime behaviors can’t be separate?
    • Matt: Analysis behavior is influenced by behavior at runtime. Maybe the vote should cover both?
    • Russell: (linked to the standard) There are 3 steps: if numeric and same type, use the data value. If the value can be rounded or truncated, round or truncate. Otherwise, throw an exception that a value can’t be cast. These are runtime requirements.
    • Ryan: Another consideration is that we can make Spark more permissive, but can’t make Spark more strict in future releases.
    • Matt: v1 silently corrupts data
    • Russell: ANSI is fine, as long as the runtime matches (is ANSI). Don’t tell people it’s ANSI and not do ANSI completely.
    • Gengliang: people are concerned about long-running jobs failing at the end
    • Ryan: That’s okay because they can change the defaults: use strict analysis-time validation, or allow casts to produce NULL values.
    • Matt: As long as this is well documented, it should be fine
    • Ryan: Can we run tests to find out what exactly the behavior is?
    • Gengliang: sqlfiddle.com
    • Russell ran tests in MySQL and Postgres. Both threw runtime failures.
    • Matt: Let’s move on, but add the runtime behavior to the VOTE
  • Identifier resolution and table lookup
    • Ryan: recent changes merged identifier resolution and table lookup together because identifiers owned by the session catalog need to be loaded to find out whether to use v1 or v2 plans. I think this should be separated so that identifier resolution happens independently to ensure that the two separate tasks don’t end up getting done at the same time and over-complicating the analyzer.
  • SHOW NAMESPACES - Ready for final review
  • DataFrameWriterV2:
    • Ryan: Tests failed after passing on the PR. Anyone know why that would happen?
    • Gengliang: tests failed in maven
    • Holden: PR validation runs SBT tests
  • TableProvider API update: skipped because Wenchen didn’t make it
  • UPDATE support PR
    • Ryan: There is a PR to add a SQL UPDATE command, but it delegates entirely to the data source, which seems strange.
    • Matt: What is Spark’s purpose here? Why would Spark parse a SQL statement only to pass it entirely to another engine?
    • Ryan: It does make sense to do this. If Spark eventually supports MERGE INTO and other row-level operations, then it makes sense to push down the operation to some sources, like JDBC. I just find it backward to add the pushdown API before adding an implementation that handles this inside Spark — pushdown is usually an optimization.
    • Russell: Would this be safe? Spark retries lots of operations.
    • Ryan: I think it would be safe because Spark won’t retry top-level operations and this is a single method call. Nothing would get retried.
    • Ryan: I’ll ask what the PR author’s use case is. Maybe that would help clarify why this is a good idea.
--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: DSv2 sync - 4 September 2019

Nicholas Chammas
Ah yes, on rereading the original email I see that the sync discussion was different. Thanks for the clarification! I’ll file a JIRA about PERMISSIVE.

2019년 9월 9일 (월) 오전 6:05, Wenchen Fan <[hidden email]>님이 작성:
Hi Nicholas,

You are talking about a different thing. The PERMISSIVE mode is the failure mode for reading text-based data source (json, csv, etc.). It's not the general failure mode for Spark table insertion.

I agree with you that the PERMISSIVE mode is hard to use. Feel free to open a JIRA ticket if you have some better ideas.

Thanks,
Wenchen

On Mon, Sep 9, 2019 at 12:46 AM Nicholas Chammas <[hidden email]> wrote:
A quick question about failure modes, as a casual observer of the DSv2 effort:

I was considering filing a JIRA ticket about enhancing the DataFrameReader to include the failure reason in addition to the corrupt record when the mode is PERMISSIVE. So if you are loading a CSV, for example, and a value cannot be automatically cast to the type you specify in the schema, you'll get the corrupt record in the column configured by columnNameOfCorruptRecord, but you'll also get some detail about what exactly made the record corrupt, perhaps in a new column specified by something like columnNameOfCorruptReason.

Is this an enhancement that would be possible in DSv2?

On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue <[hidden email]> wrote:

Here are my notes from the latest sync. Feel free to reply with clarifications if I’ve missed anything.

Attendees:

Ryan Blue
John Zhuge
Russell Spitzer
Matt Cheah
Gengliang Wang
Priyanka Gomatam
Holden Karau

Topics:

Discussion:

  • DataFrameWriterV2 insert vs append discussion recapped the agreement from last sync
  • ANSI and strict modes for inserting casts:
    • Russell: Failure modes are important. ANSI behavior is to fail at runtime, not analysis time. If a cast is allowed, but doesn’t throw an exception at runtime then this can’t be considered ANSI behavior.
    • Gengliang: ANSI adds the cast
    • Matt: Sounds like there are two conflicting views of the world. Is the default ANSI behavior to insert a cast that may produce NULL or to fail at runtime?
    • Ryan: So analysis and runtime behaviors can’t be separate?
    • Matt: Analysis behavior is influenced by behavior at runtime. Maybe the vote should cover both?
    • Russell: (linked to the standard) There are 3 steps: if numeric and same type, use the data value. If the value can be rounded or truncated, round or truncate. Otherwise, throw an exception that a value can’t be cast. These are runtime requirements.
    • Ryan: Another consideration is that we can make Spark more permissive, but can’t make Spark more strict in future releases.
    • Matt: v1 silently corrupts data
    • Russell: ANSI is fine, as long as the runtime matches (is ANSI). Don’t tell people it’s ANSI and not do ANSI completely.
    • Gengliang: people are concerned about long-running jobs failing at the end
    • Ryan: That’s okay because they can change the defaults: use strict analysis-time validation, or allow casts to produce NULL values.
    • Matt: As long as this is well documented, it should be fine
    • Ryan: Can we run tests to find out what exactly the behavior is?
    • Gengliang: sqlfiddle.com
    • Russell ran tests in MySQL and Postgres. Both threw runtime failures.
    • Matt: Let’s move on, but add the runtime behavior to the VOTE
  • Identifier resolution and table lookup
    • Ryan: recent changes merged identifier resolution and table lookup together because identifiers owned by the session catalog need to be loaded to find out whether to use v1 or v2 plans. I think this should be separated so that identifier resolution happens independently to ensure that the two separate tasks don’t end up getting done at the same time and over-complicating the analyzer.
  • SHOW NAMESPACES - Ready for final review
  • DataFrameWriterV2:
    • Ryan: Tests failed after passing on the PR. Anyone know why that would happen?
    • Gengliang: tests failed in maven
    • Holden: PR validation runs SBT tests
  • TableProvider API update: skipped because Wenchen didn’t make it
  • UPDATE support PR
    • Ryan: There is a PR to add a SQL UPDATE command, but it delegates entirely to the data source, which seems strange.
    • Matt: What is Spark’s purpose here? Why would Spark parse a SQL statement only to pass it entirely to another engine?
    • Ryan: It does make sense to do this. If Spark eventually supports MERGE INTO and other row-level operations, then it makes sense to push down the operation to some sources, like JDBC. I just find it backward to add the pushdown API before adding an implementation that handles this inside Spark — pushdown is usually an optimization.
    • Russell: Would this be safe? Spark retries lots of operations.
    • Ryan: I think it would be safe because Spark won’t retry top-level operations and this is a single method call. Nothing would get retried.
    • Ryan: I’ll ask what the PR author’s use case is. Maybe that would help clarify why this is a good idea.
--
Ryan Blue
Software Engineer
Netflix