[VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Gengliang Wang
Hi everyone,

I'd like to call for a new vote on SPARK-28885 "Follow ANSI store assignment rules in table insertion by default" after revising the ANSI store assignment policy(SPARK-29326).
When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the store assignment as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean`. It will throw a runtime exception if the value is out-of-range(overflow).
2. Legacy: Spark allows the store assignment as long as it is a valid `Cast`, which is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to an integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted into a field of Byte type, the result is 1.
3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either `double` to `int` or `decimal` to `double` is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default.

Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.

This vote is open until Friday (Oct. 11).

[ ] +1: Accept the proposal
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thank you!

Gengliang
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Alessandro Solimando
+1 (non-binding)

I have been following this standardization effort and I think it is sound and it provides the needed flexibility via the option.

Best regards,
Alessandro

On Mon, 7 Oct 2019 at 10:24, Gengliang Wang <[hidden email]> wrote:
Hi everyone,

I'd like to call for a new vote on SPARK-28885 "Follow ANSI store assignment rules in table insertion by default" after revising the ANSI store assignment policy(SPARK-29326).
When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the store assignment as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean`. It will throw a runtime exception if the value is out-of-range(overflow).
2. Legacy: Spark allows the store assignment as long as it is a valid `Cast`, which is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to an integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted into a field of Byte type, the result is 1.
3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either `double` to `int` or `decimal` to `double` is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default.

Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.

This vote is open until Friday (Oct. 11).

[ ] +1: Accept the proposal
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thank you!

Gengliang
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

cloud0fan
+1

I think this is the most reasonable default behavior among the three.

On Mon, Oct 7, 2019 at 6:06 PM Alessandro Solimando <[hidden email]> wrote:
+1 (non-binding)

I have been following this standardization effort and I think it is sound and it provides the needed flexibility via the option.

Best regards,
Alessandro

On Mon, 7 Oct 2019 at 10:24, Gengliang Wang <[hidden email]> wrote:
Hi everyone,

I'd like to call for a new vote on SPARK-28885 "Follow ANSI store assignment rules in table insertion by default" after revising the ANSI store assignment policy(SPARK-29326).
When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the store assignment as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean`. It will throw a runtime exception if the value is out-of-range(overflow).
2. Legacy: Spark allows the store assignment as long as it is a valid `Cast`, which is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to an integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted into a field of Byte type, the result is 1.
3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either `double` to `int` or `decimal` to `double` is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default.

Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.

This vote is open until Friday (Oct. 11).

[ ] +1: Accept the proposal
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thank you!

Gengliang
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

RussS
+1 (non-binding). Sounds good to me

On Mon, Oct 7, 2019 at 11:58 PM Wenchen Fan <[hidden email]> wrote:
+1

I think this is the most reasonable default behavior among the three.

On Mon, Oct 7, 2019 at 6:06 PM Alessandro Solimando <[hidden email]> wrote:
+1 (non-binding)

I have been following this standardization effort and I think it is sound and it provides the needed flexibility via the option.

Best regards,
Alessandro

On Mon, 7 Oct 2019 at 10:24, Gengliang Wang <[hidden email]> wrote:
Hi everyone,

I'd like to call for a new vote on SPARK-28885 "Follow ANSI store assignment rules in table insertion by default" after revising the ANSI store assignment policy(SPARK-29326).
When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the store assignment as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean`. It will throw a runtime exception if the value is out-of-range(overflow).
2. Legacy: Spark allows the store assignment as long as it is a valid `Cast`, which is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to an integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted into a field of Byte type, the result is 1.
3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either `double` to `int` or `decimal` to `double` is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default.

Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.

This vote is open until Friday (Oct. 11).

[ ] +1: Accept the proposal
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thank you!

Gengliang
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Takeshi Yamamuro
In reply to this post by Gengliang Wang
Thanks for the great work, Gengliang!

+1 for that.
As I said before, the behaviour is pretty common in DBMSs, so the change
helps for DMBS users.

Bests,
Takeshi


On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <[hidden email]> wrote:
Hi everyone,

I'd like to call for a new vote on SPARK-28885 "Follow ANSI store assignment rules in table insertion by default" after revising the ANSI store assignment policy(SPARK-29326).
When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the store assignment as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean`. It will throw a runtime exception if the value is out-of-range(overflow).
2. Legacy: Spark allows the store assignment as long as it is a valid `Cast`, which is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to an integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted into a field of Byte type, the result is 1.
3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either `double` to `int` or `decimal` to `double` is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default.

Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.

This vote is open until Friday (Oct. 11).

[ ] +1: Accept the proposal
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thank you!

Gengliang


--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Hyukjin Kwon
+1 (binding)

2019년 10월 10일 (목) 오후 5:11, Takeshi Yamamuro <[hidden email]>님이 작성:
Thanks for the great work, Gengliang!

+1 for that.
As I said before, the behaviour is pretty common in DBMSs, so the change
helps for DMBS users.

Bests,
Takeshi


On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <[hidden email]> wrote:
Hi everyone,

I'd like to call for a new vote on SPARK-28885 "Follow ANSI store assignment rules in table insertion by default" after revising the ANSI store assignment policy(SPARK-29326).
When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the store assignment as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean`. It will throw a runtime exception if the value is out-of-range(overflow).
2. Legacy: Spark allows the store assignment as long as it is a valid `Cast`, which is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to an integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted into a field of Byte type, the result is 1.
3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either `double` to `int` or `decimal` to `double` is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default.

Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.

This vote is open until Friday (Oct. 11).

[ ] +1: Accept the proposal
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thank you!

Gengliang


--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Xiao Li-2
+1

On Thu, Oct 10, 2019 at 2:13 AM Hyukjin Kwon <[hidden email]> wrote:
+1 (binding)

2019년 10월 10일 (목) 오후 5:11, Takeshi Yamamuro <[hidden email]>님이 작성:
Thanks for the great work, Gengliang!

+1 for that.
As I said before, the behaviour is pretty common in DBMSs, so the change
helps for DMBS users.

Bests,
Takeshi


On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <[hidden email]> wrote:
Hi everyone,

I'd like to call for a new vote on SPARK-28885 "Follow ANSI store assignment rules in table insertion by default" after revising the ANSI store assignment policy(SPARK-29326).
When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the store assignment as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean`. It will throw a runtime exception if the value is out-of-range(overflow).
2. Legacy: Spark allows the store assignment as long as it is a valid `Cast`, which is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to an integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted into a field of Byte type, the result is 1.
3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either `double` to `int` or `decimal` to `double` is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default.

Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.

This vote is open until Friday (Oct. 11).

[ ] +1: Accept the proposal
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thank you!

Gengliang


--
---
Takeshi Yamamuro
--
Databricks Summit - Watch the talks 
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Ryan Blue
+1

Thanks for fixing this!

On Thu, Oct 10, 2019 at 6:30 AM Xiao Li <[hidden email]> wrote:
+1

On Thu, Oct 10, 2019 at 2:13 AM Hyukjin Kwon <[hidden email]> wrote:
+1 (binding)

2019년 10월 10일 (목) 오후 5:11, Takeshi Yamamuro <[hidden email]>님이 작성:
Thanks for the great work, Gengliang!

+1 for that.
As I said before, the behaviour is pretty common in DBMSs, so the change
helps for DMBS users.

Bests,
Takeshi


On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <[hidden email]> wrote:
Hi everyone,

I'd like to call for a new vote on SPARK-28885 "Follow ANSI store assignment rules in table insertion by default" after revising the ANSI store assignment policy(SPARK-29326).
When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the store assignment as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean`. It will throw a runtime exception if the value is out-of-range(overflow).
2. Legacy: Spark allows the store assignment as long as it is a valid `Cast`, which is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to an integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted into a field of Byte type, the result is 1.
3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either `double` to `int` or `decimal` to `double` is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default.

Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.

This vote is open until Friday (Oct. 11).

[ ] +1: Accept the proposal
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thank you!

Gengliang


--
---
Takeshi Yamamuro
--
Databricks Summit - Watch the talks 


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Dongjoon Hyun-2
+1

Bests,
Dongjoon

On Thu, Oct 10, 2019 at 10:14 Ryan Blue <[hidden email]> wrote:
+1

Thanks for fixing this!

On Thu, Oct 10, 2019 at 6:30 AM Xiao Li <[hidden email]> wrote:
+1

On Thu, Oct 10, 2019 at 2:13 AM Hyukjin Kwon <[hidden email]> wrote:
+1 (binding)

2019년 10월 10일 (목) 오후 5:11, Takeshi Yamamuro <[hidden email]>님이 작성:
Thanks for the great work, Gengliang!

+1 for that.
As I said before, the behaviour is pretty common in DBMSs, so the change
helps for DMBS users.

Bests,
Takeshi


On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <[hidden email]> wrote:
Hi everyone,

I'd like to call for a new vote on SPARK-28885 "Follow ANSI store assignment rules in table insertion by default" after revising the ANSI store assignment policy(SPARK-29326).
When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the store assignment as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean`. It will throw a runtime exception if the value is out-of-range(overflow).
2. Legacy: Spark allows the store assignment as long as it is a valid `Cast`, which is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to an integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted into a field of Byte type, the result is 1.
3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either `double` to `int` or `decimal` to `double` is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default.

Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.

This vote is open until Friday (Oct. 11).

[ ] +1: Accept the proposal
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thank you!

Gengliang


--
---
Takeshi Yamamuro
--
Databricks Summit - Watch the talks 


--
Ryan Blue
Software Engineer
Netflix