[DISCUSS] Disable streaming query with possible correctness issue by default

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Disable streaming query with possible correctness issue by default

Liang-Chi Hsieh
Hi devs,

In Spark structured streaming, chained stateful operators possibly produces
incorrect results under the global watermark. SPARK-33259
(https://issues.apache.org/jira/browse/SPARK-33259) has an example
demostrating what the correctness issue could be.

Currently we don't prevent users running such queries. Because the possible
correctness in chained stateful operators in streaming query is not
straightforward for users. From users perspective, it will possibly be
considered as a Spark bug like SPARK-33259. It is also possible the worse
case, users are not aware of the correctness issue and use wrong results.

IMO, it is better to disable such queries and let users choose to run the
query if they understand there is such risk, instead of implicitly running
the query and let users to find out correctness issue by themselves.

I would like to propose to disable the streaming query with possible
correctness issue in chained stateful operators. The behavior can be
controlled by a SQL config, so if users understand the risk and still want
to run the query, they can disable the check.

In the PR (https://github.com/apache/spark/pull/30210), the concern I got
for now is, this changes current behavior and by default it will break some
existing streaming queries. But I think it is pretty easy to disable the
check with the new config. In the PR currently there is no objection but
suggestion to hear more voices. Please let me know if you have some
thoughts.

Thanks.
Liang-Chi Hsieh



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Disable streaming query with possible correctness issue by default

Jungtaek Lim-2
After the check logic was introduced in Spark 3.0, there's another related issue I addressed in Spark 3.1, SPARK-24634 [1].

Before SPARK-24634, there's no way to know how many rows are discarded due to being late, even whether there's any late row or not. That said, the issue has been the correctness issue "silently" impacting the result. SPARK-24634 will provide the overall number of late rows in the streaming listener, as well as the number of late rows "per operator" in the SQL UI graph. So end users are no longer "blindly" impacted.

Even though, I'd agree that it's pretty hard to construct the query which avoids correctness issues and still does chained stateful operations. I see two separate JIRA issues on reporting the same correctness behavior, meaning this is already impacting the end users' queries. (More number of end users may not even notice the impact, as SPARK-24634 isn't released yet.)

So overall I'm +1 to prevent the query in prior. This change would possibly break some of user queries, but I'd suspect they might suffer from correctness and they even didn't notice that.

For sure, a better approach would be dropping global watermark and implementing operator-wise watermark properly. This is just a workaround, but fixing watermark would require major effort.

Thanks,
Jungtaek Lim (HeartSaVioR)



On Sat, Nov 7, 2020 at 3:59 PM Liang-Chi Hsieh <[hidden email]> wrote:
Hi devs,

In Spark structured streaming, chained stateful operators possibly produces
incorrect results under the global watermark. SPARK-33259
(https://issues.apache.org/jira/browse/SPARK-33259) has an example
demostrating what the correctness issue could be.

Currently we don't prevent users running such queries. Because the possible
correctness in chained stateful operators in streaming query is not
straightforward for users. From users perspective, it will possibly be
considered as a Spark bug like SPARK-33259. It is also possible the worse
case, users are not aware of the correctness issue and use wrong results.

IMO, it is better to disable such queries and let users choose to run the
query if they understand there is such risk, instead of implicitly running
the query and let users to find out correctness issue by themselves.

I would like to propose to disable the streaming query with possible
correctness issue in chained stateful operators. The behavior can be
controlled by a SQL config, so if users understand the risk and still want
to run the query, they can disable the check.

In the PR (https://github.com/apache/spark/pull/30210), the concern I got
for now is, this changes current behavior and by default it will break some
existing streaming queries. But I think it is pretty easy to disable the
check with the new config. In the PR currently there is no objection but
suggestion to hear more voices. Please let me know if you have some
thoughts.

Thanks.
Liang-Chi Hsieh



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Disable streaming query with possible correctness issue by default

Tom Graves-2
In reply to this post by Liang-Chi Hsieh
+1 since its a correctness issue, I think its ok to change the behavior to make sure the user is aware of it and let them decide.

Tom

On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh <[hidden email]> wrote:


Hi devs,

In Spark structured streaming, chained stateful operators possibly produces
incorrect results under the global watermark. SPARK-33259
demostrating what the correctness issue could be.

Currently we don't prevent users running such queries. Because the possible
correctness in chained stateful operators in streaming query is not
straightforward for users. From users perspective, it will possibly be
considered as a Spark bug like SPARK-33259. It is also possible the worse
case, users are not aware of the correctness issue and use wrong results.

IMO, it is better to disable such queries and let users choose to run the
query if they understand there is such risk, instead of implicitly running
the query and let users to find out correctness issue by themselves.

I would like to propose to disable the streaming query with possible
correctness issue in chained stateful operators. The behavior can be
controlled by a SQL config, so if users understand the risk and still want
to run the query, they can disable the check.

In the PR (https://github.com/apache/spark/pull/30210), the concern I got
for now is, this changes current behavior and by default it will break some
existing streaming queries. But I think it is pretty easy to disable the
check with the new config. In the PR currently there is no objection but
suggestion to hear more voices. Please let me know if you have some
thoughts.

Thanks.
Liang-Chi Hsieh



--

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Disable streaming query with possible correctness issue by default

Dongjoon Hyun-2
+1 for Apache Spark 3.1.0.

Bests,
Dongjoon.

On Tue, Nov 10, 2020 at 6:17 AM Tom Graves <[hidden email]> wrote:
+1 since its a correctness issue, I think its ok to change the behavior to make sure the user is aware of it and let them decide.

Tom

On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh <[hidden email]> wrote:


Hi devs,

In Spark structured streaming, chained stateful operators possibly produces
incorrect results under the global watermark. SPARK-33259
demostrating what the correctness issue could be.

Currently we don't prevent users running such queries. Because the possible
correctness in chained stateful operators in streaming query is not
straightforward for users. From users perspective, it will possibly be
considered as a Spark bug like SPARK-33259. It is also possible the worse
case, users are not aware of the correctness issue and use wrong results.

IMO, it is better to disable such queries and let users choose to run the
query if they understand there is such risk, instead of implicitly running
the query and let users to find out correctness issue by themselves.

I would like to propose to disable the streaming query with possible
correctness issue in chained stateful operators. The behavior can be
controlled by a SQL config, so if users understand the risk and still want
to run the query, they can disable the check.

In the PR (https://github.com/apache/spark/pull/30210), the concern I got
for now is, this changes current behavior and by default it will break some
existing streaming queries. But I think it is pretty easy to disable the
check with the new config. In the PR currently there is no objection but
suggestion to hear more voices. Please let me know if you have some
thoughts.

Thanks.
Liang-Chi Hsieh



--

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Disable streaming query with possible correctness issue by default

Ryan Blue
+1, I agree with Tom.

On Tue, Nov 10, 2020 at 3:00 PM Dongjoon Hyun <[hidden email]> wrote:
+1 for Apache Spark 3.1.0.

Bests,
Dongjoon.

On Tue, Nov 10, 2020 at 6:17 AM Tom Graves <[hidden email]> wrote:
+1 since its a correctness issue, I think its ok to change the behavior to make sure the user is aware of it and let them decide.

Tom

On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh <[hidden email]> wrote:


Hi devs,

In Spark structured streaming, chained stateful operators possibly produces
incorrect results under the global watermark. SPARK-33259
demostrating what the correctness issue could be.

Currently we don't prevent users running such queries. Because the possible
correctness in chained stateful operators in streaming query is not
straightforward for users. From users perspective, it will possibly be
considered as a Spark bug like SPARK-33259. It is also possible the worse
case, users are not aware of the correctness issue and use wrong results.

IMO, it is better to disable such queries and let users choose to run the
query if they understand there is such risk, instead of implicitly running
the query and let users to find out correctness issue by themselves.

I would like to propose to disable the streaming query with possible
correctness issue in chained stateful operators. The behavior can be
controlled by a SQL config, so if users understand the risk and still want
to run the query, they can disable the check.

In the PR (https://github.com/apache/spark/pull/30210), the concern I got
for now is, this changes current behavior and by default it will break some
existing streaming queries. But I think it is pretty easy to disable the
check with the new config. In the PR currently there is no objection but
suggestion to hear more voices. Please let me know if you have some
thoughts.

Thanks.
Liang-Chi Hsieh



--

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Disable streaming query with possible correctness issue by default

Yuanjian Li
Already +1 in the PR. It would be great to mention the new config in the SS migration guide.

Ryan Blue <[hidden email]> 于2020年11月11日周三 上午7:48写道:
+1, I agree with Tom.

On Tue, Nov 10, 2020 at 3:00 PM Dongjoon Hyun <[hidden email]> wrote:
+1 for Apache Spark 3.1.0.

Bests,
Dongjoon.

On Tue, Nov 10, 2020 at 6:17 AM Tom Graves <[hidden email]> wrote:
+1 since its a correctness issue, I think its ok to change the behavior to make sure the user is aware of it and let them decide.

Tom

On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh <[hidden email]> wrote:


Hi devs,

In Spark structured streaming, chained stateful operators possibly produces
incorrect results under the global watermark. SPARK-33259
demostrating what the correctness issue could be.

Currently we don't prevent users running such queries. Because the possible
correctness in chained stateful operators in streaming query is not
straightforward for users. From users perspective, it will possibly be
considered as a Spark bug like SPARK-33259. It is also possible the worse
case, users are not aware of the correctness issue and use wrong results.

IMO, it is better to disable such queries and let users choose to run the
query if they understand there is such risk, instead of implicitly running
the query and let users to find out correctness issue by themselves.

I would like to propose to disable the streaming query with possible
correctness issue in chained stateful operators. The behavior can be
controlled by a SQL config, so if users understand the risk and still want
to run the query, they can disable the check.

In the PR (https://github.com/apache/spark/pull/30210), the concern I got
for now is, this changes current behavior and by default it will break some
existing streaming queries. But I think it is pretty easy to disable the
check with the new config. In the PR currently there is no objection but
suggestion to hear more voices. Please let me know if you have some
thoughts.

Thanks.
Liang-Chi Hsieh



--

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Disable streaming query with possible correctness issue by default

Liang-Chi Hsieh

Thanks all for the responses!

Based on these responses, I think we can go forward with the PR. I will put
the new config in the migration guide. Please help review the PR if you have
more comments.

Thank you!


Yuanjian Li wrote
> Already +1 in the PR. It would be great to mention the new config in the
> SS
> migration guide.
>
> Ryan Blue &lt;

> rblue@.com

> &gt; 于2020年11月11日周三 上午7:48写道:
>
>> +1, I agree with Tom.
>>
>> On Tue, Nov 10, 2020 at 3:00 PM Dongjoon Hyun &lt;

> dongjoon.hyun@

> &gt;
>> wrote:
>>
>>> +1 for Apache Spark 3.1.0.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Tue, Nov 10, 2020 at 6:17 AM Tom Graves &lt;

> tgraves_cs@.com

> &gt;
>>> wrote:
>>>
>>>> +1 since its a correctness issue, I think its ok to change the behavior
>>>> to make sure the user is aware of it and let them decide.
>>>>
>>>> Tom
>>>>
>>>> On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh <
>>>>

> viirya@

>> wrote:
>>>>
>>>>
>>>> Hi devs,
>>>>
>>>> In Spark structured streaming, chained stateful operators possibly
>>>> produces
>>>> incorrect results under the global watermark. SPARK-33259
>>>> (https://issues.apache.org/jira/browse/SPARK-33259) has an example
>>>> demostrating what the correctness issue could be.
>>>>
>>>> Currently we don't prevent users running such queries. Because the
>>>> possible
>>>> correctness in chained stateful operators in streaming query is not
>>>> straightforward for users. From users perspective, it will possibly be
>>>> considered as a Spark bug like SPARK-33259. It is also possible the
>>>> worse
>>>> case, users are not aware of the correctness issue and use wrong
>>>> results.
>>>>
>>>> IMO, it is better to disable such queries and let users choose to run
>>>> the
>>>> query if they understand there is such risk, instead of implicitly
>>>> running
>>>> the query and let users to find out correctness issue by themselves.
>>>>
>>>> I would like to propose to disable the streaming query with possible
>>>> correctness issue in chained stateful operators. The behavior can be
>>>> controlled by a SQL config, so if users understand the risk and still
>>>> want
>>>> to run the query, they can disable the check.
>>>>
>>>> In the PR (https://github.com/apache/spark/pull/30210), the concern I
>>>> got
>>>> for now is, this changes current behavior and by default it will break
>>>> some
>>>> existing streaming queries. But I think it is pretty easy to disable
>>>> the
>>>> check with the new config. In the PR currently there is no objection
>>>> but
>>>> suggestion to hear more voices. Please let me know if you have some
>>>> thoughts.
>>>>
>>>> Thanks.
>>>> Liang-Chi Hsieh
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail:

> dev-unsubscribe@.apache

>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]