Correctness and data loss issues

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Correctness and data loss issues

Dongjoon Hyun-2
Hi, All.

According to our policy, "Correctness and data loss issues should be considered Blockers".

    - http://spark.apache.org/contributing.html

Since we are close to branch-3.0 cut,
I want to ask your opinions on the following correctness and data loss issues.

    SPARK-30218 Columns used in inequality conditions for joins not resolved correctly in case of common lineage
    SPARK-29701 Different answers when empty input given in GROUPING SETS
    SPARK-29699 Different answers in nested aggregates with window functions
    SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
    SPARK-28125 dataframes created by randomSplit have overlapping rows
    SPARK-28067 Incorrect results in decimal aggregation with whole-stage code gen enabled
    SPARK-28024 Incorrect numeric values when out of range
    SPARK-27784 Alias ID reuse can break correctness when substituting foldable expressions
    SPARK-27619 MapType should be prohibited in hash expressions
    SPARK-27298 Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
    SPARK-27282 Spark incorrect results when using UNION with GROUP BY clause
    SPARK-27213 Unexpected results when filter is used after distinct
    SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
    SPARK-25150 Joining DataFrames derived from the same source yields confusing/incorrect results
    SPARK-21774 The rule PromoteStrings cast string to a wrong data type
    SPARK-19248 Regex_replace works in 1.6 but not in 2.0

Some of them are targeted on 3.0.0, but the others are not.
Although we will work on them until 3.0.0,
I'm not sure we can reach a status with no known correctness and data loss issue.

How do you think about the above issues?

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Correctness and data loss issues

cloud0fan
I think we need to go through them during the 3.0 QA period, and try to fix the valid ones.

For example, the first ticket should be fixed already in https://issues.apache.org/jira/browse/SPARK-28344

On Mon, Jan 20, 2020 at 2:07 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

According to our policy, "Correctness and data loss issues should be considered Blockers".

    - http://spark.apache.org/contributing.html

Since we are close to branch-3.0 cut,
I want to ask your opinions on the following correctness and data loss issues.

    SPARK-30218 Columns used in inequality conditions for joins not resolved correctly in case of common lineage
    SPARK-29701 Different answers when empty input given in GROUPING SETS
    SPARK-29699 Different answers in nested aggregates with window functions
    SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
    SPARK-28125 dataframes created by randomSplit have overlapping rows
    SPARK-28067 Incorrect results in decimal aggregation with whole-stage code gen enabled
    SPARK-28024 Incorrect numeric values when out of range
    SPARK-27784 Alias ID reuse can break correctness when substituting foldable expressions
    SPARK-27619 MapType should be prohibited in hash expressions
    SPARK-27298 Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
    SPARK-27282 Spark incorrect results when using UNION with GROUP BY clause
    SPARK-27213 Unexpected results when filter is used after distinct
    SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
    SPARK-25150 Joining DataFrames derived from the same source yields confusing/incorrect results
    SPARK-21774 The rule PromoteStrings cast string to a wrong data type
    SPARK-19248 Regex_replace works in 1.6 but not in 2.0

Some of them are targeted on 3.0.0, but the others are not.
Although we will work on them until 3.0.0,
I'm not sure we can reach a status with no known correctness and data loss issue.

How do you think about the above issues?

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Correctness and data loss issues

Dongjoon Hyun-2
Thank you for checking, Wenchen! Sure, we need to do that.

Another question is "What can we do for 2.4.5 release"?
Some of the fixes cannot be backported due to the technical difficulty like the followings.

    1. https://issues.apache.org/jira/browse/SPARK-26154
        Stream-stream joins - left outer join gives inconsistent output
        (Like this, there are eight correctness fixes which lands only at 3.0.0)

    2. https://github.com/apache/spark/pull/27233
        [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given
        (This is on-going PR which is currently blocking 2.4.5 RC2).

Bests,
Dongjoon.

On Tue, Jan 21, 2020 at 11:10 PM Wenchen Fan <[hidden email]> wrote:
I think we need to go through them during the 3.0 QA period, and try to fix the valid ones.

For example, the first ticket should be fixed already in https://issues.apache.org/jira/browse/SPARK-28344

On Mon, Jan 20, 2020 at 2:07 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

According to our policy, "Correctness and data loss issues should be considered Blockers".

    - http://spark.apache.org/contributing.html

Since we are close to branch-3.0 cut,
I want to ask your opinions on the following correctness and data loss issues.

    SPARK-30218 Columns used in inequality conditions for joins not resolved correctly in case of common lineage
    SPARK-29701 Different answers when empty input given in GROUPING SETS
    SPARK-29699 Different answers in nested aggregates with window functions
    SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
    SPARK-28125 dataframes created by randomSplit have overlapping rows
    SPARK-28067 Incorrect results in decimal aggregation with whole-stage code gen enabled
    SPARK-28024 Incorrect numeric values when out of range
    SPARK-27784 Alias ID reuse can break correctness when substituting foldable expressions
    SPARK-27619 MapType should be prohibited in hash expressions
    SPARK-27298 Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
    SPARK-27282 Spark incorrect results when using UNION with GROUP BY clause
    SPARK-27213 Unexpected results when filter is used after distinct
    SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
    SPARK-25150 Joining DataFrames derived from the same source yields confusing/incorrect results
    SPARK-21774 The rule PromoteStrings cast string to a wrong data type
    SPARK-19248 Regex_replace works in 1.6 but not in 2.0

Some of them are targeted on 3.0.0, but the others are not.
Although we will work on them until 3.0.0,
I'm not sure we can reach a status with no known correctness and data loss issue.

How do you think about the above issues?

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Correctness and data loss issues

Tom Graves-2
In reply to this post by Dongjoon Hyun-2
I agree, I think we just need to go through all of them and individual assess each one. If it's really a correctness issue we should hold 3.0 for it.

On the 2.4 release I didn't see an explanation on  https://issues.apache.org/jira/browse/SPARK-26154 why it can't be back ported, I think in the very least we need that in each jira comment.

spark-29701 looks more like compatibility with Postgres then a purely wrong answer to me, if Spark has been consistent about that it feels like it can wait for 3.0 but would be good to get others input and I'm not an expert on SQL standard and what do the other sql engines do in this case.

Tom

On Monday, January 20, 2020, 12:07:54 AM CST, Dongjoon Hyun <[hidden email]> wrote:


Hi, All.

According to our policy, "Correctness and data loss issues should be considered Blockers".

    - http://spark.apache.org/contributing.html

Since we are close to branch-3.0 cut,
I want to ask your opinions on the following correctness and data loss issues.

    SPARK-30218 Columns used in inequality conditions for joins not resolved correctly in case of common lineage
    SPARK-29701 Different answers when empty input given in GROUPING SETS
    SPARK-29699 Different answers in nested aggregates with window functions
    SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
    SPARK-28125 dataframes created by randomSplit have overlapping rows
    SPARK-28067 Incorrect results in decimal aggregation with whole-stage code gen enabled
    SPARK-28024 Incorrect numeric values when out of range
    SPARK-27784 Alias ID reuse can break correctness when substituting foldable expressions
    SPARK-27619 MapType should be prohibited in hash expressions
    SPARK-27298 Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
    SPARK-27282 Spark incorrect results when using UNION with GROUP BY clause
    SPARK-27213 Unexpected results when filter is used after distinct
    SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
    SPARK-25150 Joining DataFrames derived from the same source yields confusing/incorrect results
    SPARK-21774 The rule PromoteStrings cast string to a wrong data type
    SPARK-19248 Regex_replace works in 1.6 but not in 2.0

Some of them are targeted on 3.0.0, but the others are not.
Although we will work on them until 3.0.0,
I'm not sure we can reach a status with no known correctness and data loss issue.

How do you think about the above issues?

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Correctness and data loss issues

Dongjoon Hyun-2
Hi, Tom.

Then, along with the following, do you think we need to hold on 2.4.5 release, too?

> If it's really a correctness issue we should hold 3.0 for it.

Recently,

    (1) 2.4.4 delivered 9 correctness patches.
    (2) 2.4.5 RC1 aimed to deliver the following 9 correctness patches, too.

        SPARK-29101 CSV datasource returns incorrect .count() from file with malformed records
        SPARK-30447 Constant propagation nullability issue
        SPARK-29708 Different answers in aggregates of duplicate grouping sets
        SPARK-29651 Incorrect parsing of interval seconds fraction
        SPARK-29918 RecordBinaryComparator should check endianness when compared by long
        SPARK-29042 Sampling-based RDD with unordered input should be INDETERMINATE
        SPARK-30082 Zeros are being treated as NaNs
        SPARK-29743 sample should set needCopyResult to true if its child is
        SPARK-26985 Test "access only some column of the all of columns " fails on big endian

Without the official Apache Spark 2.4.5 binaries,
there is no official way to deliver the 9 correctness fixes in (2) to the users.
In addition, usually, the correctness fixes are independent to each other.

Bests,
Dongjoon.


On Wed, Jan 22, 2020 at 7:02 AM Tom Graves <[hidden email]> wrote:
I agree, I think we just need to go through all of them and individual assess each one. If it's really a correctness issue we should hold 3.0 for it.

On the 2.4 release I didn't see an explanation on  https://issues.apache.org/jira/browse/SPARK-26154 why it can't be back ported, I think in the very least we need that in each jira comment.

spark-29701 looks more like compatibility with Postgres then a purely wrong answer to me, if Spark has been consistent about that it feels like it can wait for 3.0 but would be good to get others input and I'm not an expert on SQL standard and what do the other sql engines do in this case.

Tom

On Monday, January 20, 2020, 12:07:54 AM CST, Dongjoon Hyun <[hidden email]> wrote:


Hi, All.

According to our policy, "Correctness and data loss issues should be considered Blockers".

    - http://spark.apache.org/contributing.html

Since we are close to branch-3.0 cut,
I want to ask your opinions on the following correctness and data loss issues.

    SPARK-30218 Columns used in inequality conditions for joins not resolved correctly in case of common lineage
    SPARK-29701 Different answers when empty input given in GROUPING SETS
    SPARK-29699 Different answers in nested aggregates with window functions
    SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
    SPARK-28125 dataframes created by randomSplit have overlapping rows
    SPARK-28067 Incorrect results in decimal aggregation with whole-stage code gen enabled
    SPARK-28024 Incorrect numeric values when out of range
    SPARK-27784 Alias ID reuse can break correctness when substituting foldable expressions
    SPARK-27619 MapType should be prohibited in hash expressions
    SPARK-27298 Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
    SPARK-27282 Spark incorrect results when using UNION with GROUP BY clause
    SPARK-27213 Unexpected results when filter is used after distinct
    SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
    SPARK-25150 Joining DataFrames derived from the same source yields confusing/incorrect results
    SPARK-21774 The rule PromoteStrings cast string to a wrong data type
    SPARK-19248 Regex_replace works in 1.6 but not in 2.0

Some of them are targeted on 3.0.0, but the others are not.
Although we will work on them until 3.0.0,
I'm not sure we can reach a status with no known correctness and data loss issue.

How do you think about the above issues?

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Correctness and data loss issues

Dongjoon Hyun-2
Hi, All.

BTW, based on the AS-IS feedbacks,
I updated all open `correctness` and `dataloss` issues like the followings.

    1. Raised the issue priority into `Blocker`.
    2. Set the target version to `3.0.0`.

It's a time to give more visibility to those issues in order to close or resolve.

The remaining things are the followings:

    1. Revisit `3.0.0`-only correctness patches?
    2. Set the target version to `2.4.5`? (Specifically, is this feasible in terms of timeline?)

Bests,
Dongjoon.


On Wed, Jan 22, 2020 at 9:43 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, Tom.

Then, along with the following, do you think we need to hold on 2.4.5 release, too?

> If it's really a correctness issue we should hold 3.0 for it.

Recently,

    (1) 2.4.4 delivered 9 correctness patches.
    (2) 2.4.5 RC1 aimed to deliver the following 9 correctness patches, too.

        SPARK-29101 CSV datasource returns incorrect .count() from file with malformed records
        SPARK-30447 Constant propagation nullability issue
        SPARK-29708 Different answers in aggregates of duplicate grouping sets
        SPARK-29651 Incorrect parsing of interval seconds fraction
        SPARK-29918 RecordBinaryComparator should check endianness when compared by long
        SPARK-29042 Sampling-based RDD with unordered input should be INDETERMINATE
        SPARK-30082 Zeros are being treated as NaNs
        SPARK-29743 sample should set needCopyResult to true if its child is
        SPARK-26985 Test "access only some column of the all of columns " fails on big endian

Without the official Apache Spark 2.4.5 binaries,
there is no official way to deliver the 9 correctness fixes in (2) to the users.
In addition, usually, the correctness fixes are independent to each other.

Bests,
Dongjoon.


On Wed, Jan 22, 2020 at 7:02 AM Tom Graves <[hidden email]> wrote:
I agree, I think we just need to go through all of them and individual assess each one. If it's really a correctness issue we should hold 3.0 for it.

On the 2.4 release I didn't see an explanation on  https://issues.apache.org/jira/browse/SPARK-26154 why it can't be back ported, I think in the very least we need that in each jira comment.

spark-29701 looks more like compatibility with Postgres then a purely wrong answer to me, if Spark has been consistent about that it feels like it can wait for 3.0 but would be good to get others input and I'm not an expert on SQL standard and what do the other sql engines do in this case.

Tom

On Monday, January 20, 2020, 12:07:54 AM CST, Dongjoon Hyun <[hidden email]> wrote:


Hi, All.

According to our policy, "Correctness and data loss issues should be considered Blockers".

    - http://spark.apache.org/contributing.html

Since we are close to branch-3.0 cut,
I want to ask your opinions on the following correctness and data loss issues.

    SPARK-30218 Columns used in inequality conditions for joins not resolved correctly in case of common lineage
    SPARK-29701 Different answers when empty input given in GROUPING SETS
    SPARK-29699 Different answers in nested aggregates with window functions
    SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
    SPARK-28125 dataframes created by randomSplit have overlapping rows
    SPARK-28067 Incorrect results in decimal aggregation with whole-stage code gen enabled
    SPARK-28024 Incorrect numeric values when out of range
    SPARK-27784 Alias ID reuse can break correctness when substituting foldable expressions
    SPARK-27619 MapType should be prohibited in hash expressions
    SPARK-27298 Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
    SPARK-27282 Spark incorrect results when using UNION with GROUP BY clause
    SPARK-27213 Unexpected results when filter is used after distinct
    SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
    SPARK-25150 Joining DataFrames derived from the same source yields confusing/incorrect results
    SPARK-21774 The rule PromoteStrings cast string to a wrong data type
    SPARK-19248 Regex_replace works in 1.6 but not in 2.0

Some of them are targeted on 3.0.0, but the others are not.
Although we will work on them until 3.0.0,
I'm not sure we can reach a status with no known correctness and data loss issue.

How do you think about the above issues?

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Correctness and data loss issues

Tom Graves-2
In reply to this post by Dongjoon Hyun-2
My thoughts on your list, would be good to get people who worked on these issues input. Obviously we can weigh the importance of these vs getting 2.4.5 out that has a bunch of other correctness fixes you mention as well.  I think you have already pinged on most of the jira to get feedback.


 SPARK-30218 Columns used in inequality conditions for joins not resolved correctly in case of common lineage
You already linked to SPARK-28344 and asked the question about back port

    SPARK-29701 Different answers when empty input given in GROUPING SETS
This seems like Postgres compatibility thing again not a correctness issue

    SPARK-29699 Different answers in nested aggregates with window functions
This seems like Postgres compatibility thing again not a correctness issue

    SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe 
This is currently listed as an improvement and I can see an argument user has to explicitly do this in separate threads so seems less critical to me though definitely nice to fix. personally think its ok to not have in 2.4.5

    SPARK-28125 dataframes created by randomSplit have overlapping rows
Seems like something we should fix

    SPARK-28067 Incorrect results in decimal aggregation with whole-stage code gen enabled
Seems like we should fix

    SPARK-28024 Incorrect numeric values when out of range
Seems like we could skip for 2.4.5 and some overflow exceptions fixed in 3.0

    SPARK-27784 Alias ID reuse can break correctness when substituting foldable expressions
Would be good to understand what fixed in 3.0 to see if can back port

    SPARK-27619 MapType should be prohibited in hash expressions
Seems behavioral to me and its been consistent so seems ok to skip for 2.4.5

    SPARK-27298 Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
Seems to be a windows vs linux issue and seems like we should investigate

    SPARK-27282 Spark incorrect results when using UNION with GROUP BY clause
Similar seems to be fixed in spark 3.0 so need to see if we can back port if we can find what fixed

    SPARK-27213 Unexpected results when filter is used after distinct
Need to try to reproduce on 2.4.X

    SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
Seems like we should investigate further for 2.4.x fix

    SPARK-25150 Joining DataFrames derived from the same source yields confusing/incorrect results
Seems like we should investigate further for 2.4.x fix

    SPARK-21774 The rule PromoteStrings cast string to a wrong data type
Seems like we should investigate further for 2.4.x fix

    SPARK-19248 Regex_replace works in 1.6 but not in 2.0
Seems wrong but if its been consistent for the entire 2.0 may be ok to skip for 2.4.x

Tom
On Wednesday, January 22, 2020, 11:43:30 AM CST, Dongjoon Hyun <[hidden email]> wrote:


Hi, Tom.

Then, along with the following, do you think we need to hold on 2.4.5 release, too?

> If it's really a correctness issue we should hold 3.0 for it.

Recently,

    (1) 2.4.4 delivered 9 correctness patches.
    (2) 2.4.5 RC1 aimed to deliver the following 9 correctness patches, too.

        SPARK-29101 CSV datasource returns incorrect .count() from file with malformed records
        SPARK-30447 Constant propagation nullability issue
        SPARK-29708 Different answers in aggregates of duplicate grouping sets
        SPARK-29651 Incorrect parsing of interval seconds fraction
        SPARK-29918 RecordBinaryComparator should check endianness when compared by long
        SPARK-29042 Sampling-based RDD with unordered input should be INDETERMINATE
        SPARK-30082 Zeros are being treated as NaNs
        SPARK-29743 sample should set needCopyResult to true if its child is
        SPARK-26985 Test "access only some column of the all of columns " fails on big endian

Without the official Apache Spark 2.4.5 binaries,
there is no official way to deliver the 9 correctness fixes in (2) to the users.
In addition, usually, the correctness fixes are independent to each other.

Bests,
Dongjoon.


On Wed, Jan 22, 2020 at 7:02 AM Tom Graves <[hidden email]> wrote:
I agree, I think we just need to go through all of them and individual assess each one. If it's really a correctness issue we should hold 3.0 for it.

On the 2.4 release I didn't see an explanation on  https://issues.apache.org/jira/browse/SPARK-26154 why it can't be back ported, I think in the very least we need that in each jira comment.

spark-29701 looks more like compatibility with Postgres then a purely wrong answer to me, if Spark has been consistent about that it feels like it can wait for 3.0 but would be good to get others input and I'm not an expert on SQL standard and what do the other sql engines do in this case.

Tom

On Monday, January 20, 2020, 12:07:54 AM CST, Dongjoon Hyun <[hidden email]> wrote:


Hi, All.

According to our policy, "Correctness and data loss issues should be considered Blockers".

    - http://spark.apache.org/contributing.html

Since we are close to branch-3.0 cut,
I want to ask your opinions on the following correctness and data loss issues.

    SPARK-30218 Columns used in inequality conditions for joins not resolved correctly in case of common lineage
    SPARK-29701 Different answers when empty input given in GROUPING SETS
    SPARK-29699 Different answers in nested aggregates with window functions
    SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
    SPARK-28125 dataframes created by randomSplit have overlapping rows
    SPARK-28067 Incorrect results in decimal aggregation with whole-stage code gen enabled
    SPARK-28024 Incorrect numeric values when out of range
    SPARK-27784 Alias ID reuse can break correctness when substituting foldable expressions
    SPARK-27619 MapType should be prohibited in hash expressions
    SPARK-27298 Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
    SPARK-27282 Spark incorrect results when using UNION with GROUP BY clause
    SPARK-27213 Unexpected results when filter is used after distinct
    SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
    SPARK-25150 Joining DataFrames derived from the same source yields confusing/incorrect results
    SPARK-21774 The rule PromoteStrings cast string to a wrong data type
    SPARK-19248 Regex_replace works in 1.6 but not in 2.0

Some of them are targeted on 3.0.0, but the others are not.
Although we will work on them until 3.0.0,
I'm not sure we can reach a status with no known correctness and data loss issue.

How do you think about the above issues?

Bests,
Dongjoon.