[DISCUSS] Resolve ambiguous parser rule between two "create table"s

classic Classic list List threaded Threaded
36 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Jungtaek Lim-2
Thanks, filed SPARK-31257. Thanks again for looking into this - I'll take a look whenever I get time sooner.

On Thu, Mar 26, 2020 at 8:06 AM Ryan Blue <[hidden email]> wrote:
Feel free to open another issue, I just used that one since it describes this and doesn't appear to be done.

On Wed, Mar 25, 2020 at 4:03 PM Jungtaek Lim <[hidden email]> wrote:
UPDATE: Sorry I just missed the PR (https://github.com/apache/spark/pull/28026). I still think it'd be nice to avoid recycling the JIRA issue which was resolved before. Shall we have a new JIRA issue with linking to SPARK-30098, and set proper priority?

On Thu, Mar 26, 2020 at 7:59 AM Jungtaek Lim <[hidden email]> wrote:
Would it be better to prioritize this to make sure the change is included in Spark 3.0? (Maybe filing an issue and set as a blocker)

Looks like there's consensus that SPARK-30098 brought ambiguous issue which should be fixed (though the consideration of severity seems to be different), and once we notice the issue it would be really odd if we publish it as it is, and try to fix it later (the fix may not be even included in 3.0.x as it might bring behavioral change).

On Tue, Mar 24, 2020 at 3:37 PM Wenchen Fan <[hidden email]> wrote:
Hi Ryan,

It's great to hear that you are cleaning up this long-standing mess. Please let me know if you hit any problems that I can help with.

Thanks,
Wenchen

On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas <[hidden email]> wrote:
On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan <[hidden email]> wrote:
2. PARTITIONED BY colTypeList: I think we can support it in the unified syntax. Just make sure it doesn't appear together with PARTITIONED BY transformList.

Another side note: Perhaps as part of (or after) unifying the CREATE TABLE syntax, we can also update Catalog.createTable() to support creating partitioned tables.


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

cloud0fan
Hi all,

I'd like to bring this up again to share the status and get more feedback. Currently, we all agree to unify the CREATE TABLE syntax by merging the native and Hive-style syntaxes.

The unified CREATE TABLE syntax will become the native syntax and there is no Hive-style syntax anymore. This brings several changes:
1. support PARTITION BY (col type, ...). This can't co-exist with PARTITION BY (col, ...), and simply adds partition columns to the end.
2. support SKEWED BY, which just fails
3. support STORE AS/BY, which can't co-exist with USING provider
4. support EXTERNAL as well

All the behaviors will remain the same as before, for the builtin catalog. However, the native CREATE TABLE syntax needs to support the v2 CreateTable command and we need to translate the new syntax changes to catalog plugin API calls, and we are still working on reaching an agreement about how to do it.

To unblock 3.0, I think there are two choices:
1. Turn on spark.sql.legacy.createHiveTableByDefault.enabled by default, which effectively revert SPARK-30098. The CREATE TABLE syntax is still confusing but it's the same as 2.4
2. Do not support the v2 CreateTable command if STORE AS/BY or EXTERNAL is specified. This gives us more time to think about how to do it in 3.1.

If you have other ideas, please reply to this thread.

Thanks,
Wenchen

On Thu, Mar 26, 2020 at 7:28 AM Jungtaek Lim <[hidden email]> wrote:
Thanks, filed SPARK-31257. Thanks again for looking into this - I'll take a look whenever I get time sooner.

On Thu, Mar 26, 2020 at 8:06 AM Ryan Blue <[hidden email]> wrote:
Feel free to open another issue, I just used that one since it describes this and doesn't appear to be done.

On Wed, Mar 25, 2020 at 4:03 PM Jungtaek Lim <[hidden email]> wrote:
UPDATE: Sorry I just missed the PR (https://github.com/apache/spark/pull/28026). I still think it'd be nice to avoid recycling the JIRA issue which was resolved before. Shall we have a new JIRA issue with linking to SPARK-30098, and set proper priority?

On Thu, Mar 26, 2020 at 7:59 AM Jungtaek Lim <[hidden email]> wrote:
Would it be better to prioritize this to make sure the change is included in Spark 3.0? (Maybe filing an issue and set as a blocker)

Looks like there's consensus that SPARK-30098 brought ambiguous issue which should be fixed (though the consideration of severity seems to be different), and once we notice the issue it would be really odd if we publish it as it is, and try to fix it later (the fix may not be even included in 3.0.x as it might bring behavioral change).

On Tue, Mar 24, 2020 at 3:37 PM Wenchen Fan <[hidden email]> wrote:
Hi Ryan,

It's great to hear that you are cleaning up this long-standing mess. Please let me know if you hit any problems that I can help with.

Thanks,
Wenchen

On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas <[hidden email]> wrote:
On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan <[hidden email]> wrote:
2. PARTITIONED BY colTypeList: I think we can support it in the unified syntax. Just make sure it doesn't appear together with PARTITIONED BY transformList.

Another side note: Perhaps as part of (or after) unifying the CREATE TABLE syntax, we can also update Catalog.createTable() to support creating partitioned tables.


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Jungtaek Lim-2
Let's focus on how to unblock Spark 3.0.0 for now, as other blockers are getting resolved.

I'm in favor of option 1 to avoid bring multiple backward incompatible changes. Unifying create table would bring backward incompatibility (I'd rather say the new syntax should be cleared up ignoring the backward compatibility) and we'd be better to not force end users to adopt the changes twice.

On Fri, May 8, 2020 at 11:22 PM Wenchen Fan <[hidden email]> wrote:
Hi all,

I'd like to bring this up again to share the status and get more feedback. Currently, we all agree to unify the CREATE TABLE syntax by merging the native and Hive-style syntaxes.

The unified CREATE TABLE syntax will become the native syntax and there is no Hive-style syntax anymore. This brings several changes:
1. support PARTITION BY (col type, ...). This can't co-exist with PARTITION BY (col, ...), and simply adds partition columns to the end.
2. support SKEWED BY, which just fails
3. support STORE AS/BY, which can't co-exist with USING provider
4. support EXTERNAL as well

All the behaviors will remain the same as before, for the builtin catalog. However, the native CREATE TABLE syntax needs to support the v2 CreateTable command and we need to translate the new syntax changes to catalog plugin API calls, and we are still working on reaching an agreement about how to do it.

To unblock 3.0, I think there are two choices:
1. Turn on spark.sql.legacy.createHiveTableByDefault.enabled by default, which effectively revert SPARK-30098. The CREATE TABLE syntax is still confusing but it's the same as 2.4
2. Do not support the v2 CreateTable command if STORE AS/BY or EXTERNAL is specified. This gives us more time to think about how to do it in 3.1.

If you have other ideas, please reply to this thread.

Thanks,
Wenchen

On Thu, Mar 26, 2020 at 7:28 AM Jungtaek Lim <[hidden email]> wrote:
Thanks, filed SPARK-31257. Thanks again for looking into this - I'll take a look whenever I get time sooner.

On Thu, Mar 26, 2020 at 8:06 AM Ryan Blue <[hidden email]> wrote:
Feel free to open another issue, I just used that one since it describes this and doesn't appear to be done.

On Wed, Mar 25, 2020 at 4:03 PM Jungtaek Lim <[hidden email]> wrote:
UPDATE: Sorry I just missed the PR (https://github.com/apache/spark/pull/28026). I still think it'd be nice to avoid recycling the JIRA issue which was resolved before. Shall we have a new JIRA issue with linking to SPARK-30098, and set proper priority?

On Thu, Mar 26, 2020 at 7:59 AM Jungtaek Lim <[hidden email]> wrote:
Would it be better to prioritize this to make sure the change is included in Spark 3.0? (Maybe filing an issue and set as a blocker)

Looks like there's consensus that SPARK-30098 brought ambiguous issue which should be fixed (though the consideration of severity seems to be different), and once we notice the issue it would be really odd if we publish it as it is, and try to fix it later (the fix may not be even included in 3.0.x as it might bring behavioral change).

On Tue, Mar 24, 2020 at 3:37 PM Wenchen Fan <[hidden email]> wrote:
Hi Ryan,

It's great to hear that you are cleaning up this long-standing mess. Please let me know if you hit any problems that I can help with.

Thanks,
Wenchen

On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas <[hidden email]> wrote:
On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan <[hidden email]> wrote:
2. PARTITIONED BY colTypeList: I think we can support it in the unified syntax. Just make sure it doesn't appear together with PARTITIONED BY transformList.

Another side note: Perhaps as part of (or after) unifying the CREATE TABLE syntax, we can also update Catalog.createTable() to support creating partitioned tables.


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Xiao Li-2
1. Turn on spark.sql.legacy.createHiveTableByDefault.enabled by default, which effectively revert SPARK-30098. The CREATE TABLE syntax is still confusing but it's the same as 2.4
2. Do not support the v2 CreateTable command if STORE AS/BY or EXTERNAL is specified. This gives us more time to think about how to do it in 3.1.

I prefer to first turn on spark.sql.legacy.createHiveTableByDefault.enabled by default and then start RC2 first. 

We still can continue trying option 2, if we can finish it within 10 days. BTW, we still have multiple ongoing discussions about data source v2 APIs. To be honest, most Spark users will not hit these cases in Spark 3.0. Thus, temporarily blocking a few cases in DSV2 looks reasonable to me. We can support them in Spark 3.1. 

Xiao 





On Sun, May 10, 2020 at 9:32 PM Jungtaek Lim <[hidden email]> wrote:
Let's focus on how to unblock Spark 3.0.0 for now, as other blockers are getting resolved.

I'm in favor of option 1 to avoid bring multiple backward incompatible changes. Unifying create table would bring backward incompatibility (I'd rather say the new syntax should be cleared up ignoring the backward compatibility) and we'd be better to not force end users to adopt the changes twice.

On Fri, May 8, 2020 at 11:22 PM Wenchen Fan <[hidden email]> wrote:
Hi all,

I'd like to bring this up again to share the status and get more feedback. Currently, we all agree to unify the CREATE TABLE syntax by merging the native and Hive-style syntaxes.

The unified CREATE TABLE syntax will become the native syntax and there is no Hive-style syntax anymore. This brings several changes:
1. support PARTITION BY (col type, ...). This can't co-exist with PARTITION BY (col, ...), and simply adds partition columns to the end.
2. support SKEWED BY, which just fails
3. support STORE AS/BY, which can't co-exist with USING provider
4. support EXTERNAL as well

All the behaviors will remain the same as before, for the builtin catalog. However, the native CREATE TABLE syntax needs to support the v2 CreateTable command and we need to translate the new syntax changes to catalog plugin API calls, and we are still working on reaching an agreement about how to do it.

To unblock 3.0, I think there are two choices:
1. Turn on spark.sql.legacy.createHiveTableByDefault.enabled by default, which effectively revert SPARK-30098. The CREATE TABLE syntax is still confusing but it's the same as 2.4
2. Do not support the v2 CreateTable command if STORE AS/BY or EXTERNAL is specified. This gives us more time to think about how to do it in 3.1.

If you have other ideas, please reply to this thread.

Thanks,
Wenchen

On Thu, Mar 26, 2020 at 7:28 AM Jungtaek Lim <[hidden email]> wrote:
Thanks, filed SPARK-31257. Thanks again for looking into this - I'll take a look whenever I get time sooner.

On Thu, Mar 26, 2020 at 8:06 AM Ryan Blue <[hidden email]> wrote:
Feel free to open another issue, I just used that one since it describes this and doesn't appear to be done.

On Wed, Mar 25, 2020 at 4:03 PM Jungtaek Lim <[hidden email]> wrote:
UPDATE: Sorry I just missed the PR (https://github.com/apache/spark/pull/28026). I still think it'd be nice to avoid recycling the JIRA issue which was resolved before. Shall we have a new JIRA issue with linking to SPARK-30098, and set proper priority?

On Thu, Mar 26, 2020 at 7:59 AM Jungtaek Lim <[hidden email]> wrote:
Would it be better to prioritize this to make sure the change is included in Spark 3.0? (Maybe filing an issue and set as a blocker)

Looks like there's consensus that SPARK-30098 brought ambiguous issue which should be fixed (though the consideration of severity seems to be different), and once we notice the issue it would be really odd if we publish it as it is, and try to fix it later (the fix may not be even included in 3.0.x as it might bring behavioral change).

On Tue, Mar 24, 2020 at 3:37 PM Wenchen Fan <[hidden email]> wrote:
Hi Ryan,

It's great to hear that you are cleaning up this long-standing mess. Please let me know if you hit any problems that I can help with.

Thanks,
Wenchen

On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas <[hidden email]> wrote:
On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan <[hidden email]> wrote:
2. PARTITIONED BY colTypeList: I think we can support it in the unified syntax. Just make sure it doesn't appear together with PARTITIONED BY transformList.

Another side note: Perhaps as part of (or after) unifying the CREATE TABLE syntax, we can also update Catalog.createTable() to support creating partitioned tables.


--
Ryan Blue
Software Engineer
Netflix


--
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

JackyLee
+1. Agree with Xiao Li and Jungtaek Lim.

This seems to be controversial, and can not be done in a short time. It is
necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Ryan Blue
I'm all for fixing behavior in master by turning this off as an intermediate step, but I don't think that Spark 3.0 can safely include SPARK-30098.

The problem is that SPARK-30098 introduces strange behavior, as Jungtaek pointed out. And that behavior is not fully understood. While working on a unified CREATE TABLE syntax, I hit additional test failures where the wrong create path was being used.

Unless we plan to NOT support the behavior when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal with this problem for years to come.

On Mon, May 11, 2020 at 1:06 AM JackyLee <[hidden email]> wrote:
+1. Agree with Xiao Li and Jungtaek Lim.

This seems to be controversial, and can not be done in a short time. It is
necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

RussS
I think reverting 30098 is the right decision here if we want to unblock 3.0. We shouldn't ship with features which we know do not function in the way we intend, regardless of how little exposure most users have to them. Even if it's off my default, we should probably work to avoid switches that cause things to behave unpredictably or require a flow chart to actually determine what will happen.

On Mon, May 11, 2020 at 3:07 PM Ryan Blue <[hidden email]> wrote:
I'm all for fixing behavior in master by turning this off as an intermediate step, but I don't think that Spark 3.0 can safely include SPARK-30098.

The problem is that SPARK-30098 introduces strange behavior, as Jungtaek pointed out. And that behavior is not fully understood. While working on a unified CREATE TABLE syntax, I hit additional test failures where the wrong create path was being used.

Unless we plan to NOT support the behavior when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal with this problem for years to come.

On Mon, May 11, 2020 at 1:06 AM JackyLee <[hidden email]> wrote:
+1. Agree with Xiao Li and Jungtaek Lim.

This seems to be controversial, and can not be done in a short time. It is
necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Jungtaek Lim-2
I'm sorry, but I have to agree with Ryan and Russell. I chose the option 1 because it's less worse than option 2, but it doesn't mean I fully agree with option 1.

Let's make below things clear if we really go with option 1, otherwise please consider reverting it.

* Do you fully indicate about "all" the paths where the second create table syntax is taken?
* Could you explain "why" to end users without any confusion? Do you think end users will understand it easily?
* Do you have an actual end users to guide to turn this on? Or do you have a plan to turn this on for your team/customers and deal with the ambiguity?
* Could you please document about how things will change if the flag is turned on?

I guess the option 1 is to leave a flag as "undocumented" one and forget about the path to turn on, but I think that would lead to make the feature be "broken window" even we are not able to touch.

On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <[hidden email]> wrote:
I think reverting 30098 is the right decision here if we want to unblock 3.0. We shouldn't ship with features which we know do not function in the way we intend, regardless of how little exposure most users have to them. Even if it's off my default, we should probably work to avoid switches that cause things to behave unpredictably or require a flow chart to actually determine what will happen.

On Mon, May 11, 2020 at 3:07 PM Ryan Blue <[hidden email]> wrote:
I'm all for fixing behavior in master by turning this off as an intermediate step, but I don't think that Spark 3.0 can safely include SPARK-30098.

The problem is that SPARK-30098 introduces strange behavior, as Jungtaek pointed out. And that behavior is not fully understood. While working on a unified CREATE TABLE syntax, I hit additional test failures where the wrong create path was being used.

Unless we plan to NOT support the behavior when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal with this problem for years to come.

On Mon, May 11, 2020 at 1:06 AM JackyLee <[hidden email]> wrote:
+1. Agree with Xiao Li and Jungtaek Lim.

This seems to be controversial, and can not be done in a short time. It is
necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Jungtaek Lim-2
Btw another wondering here is, is it good to retain the flag on master as an intermediate step? Wouldn't it be better for us to start "unified create table syntax" from scratch?


On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <[hidden email]> wrote:
I'm sorry, but I have to agree with Ryan and Russell. I chose the option 1 because it's less worse than option 2, but it doesn't mean I fully agree with option 1.

Let's make below things clear if we really go with option 1, otherwise please consider reverting it.

* Do you fully indicate about "all" the paths where the second create table syntax is taken?
* Could you explain "why" to end users without any confusion? Do you think end users will understand it easily?
* Do you have an actual end users to guide to turn this on? Or do you have a plan to turn this on for your team/customers and deal with the ambiguity?
* Could you please document about how things will change if the flag is turned on?

I guess the option 1 is to leave a flag as "undocumented" one and forget about the path to turn on, but I think that would lead to make the feature be "broken window" even we are not able to touch.

On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <[hidden email]> wrote:
I think reverting 30098 is the right decision here if we want to unblock 3.0. We shouldn't ship with features which we know do not function in the way we intend, regardless of how little exposure most users have to them. Even if it's off my default, we should probably work to avoid switches that cause things to behave unpredictably or require a flow chart to actually determine what will happen.

On Mon, May 11, 2020 at 3:07 PM Ryan Blue <[hidden email]> wrote:
I'm all for fixing behavior in master by turning this off as an intermediate step, but I don't think that Spark 3.0 can safely include SPARK-30098.

The problem is that SPARK-30098 introduces strange behavior, as Jungtaek pointed out. And that behavior is not fully understood. While working on a unified CREATE TABLE syntax, I hit additional test failures where the wrong create path was being used.

Unless we plan to NOT support the behavior when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal with this problem for years to come.

On Mon, May 11, 2020 at 1:06 AM JackyLee <[hidden email]> wrote:
+1. Agree with Xiao Li and Jungtaek Lim.

This seems to be controversial, and can not be done in a short time. It is
necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Ryan Blue
I'm all for getting the unified syntax into master. The only issue appears to be whether or not to pass the presence of the EXTERNAL keyword through to a catalog in v2. Maybe it's time to start a discuss thread for that issue so we're not stuck for another 6 weeks on it.

On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <[hidden email]> wrote:
Btw another wondering here is, is it good to retain the flag on master as an intermediate step? Wouldn't it be better for us to start "unified create table syntax" from scratch?


On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <[hidden email]> wrote:
I'm sorry, but I have to agree with Ryan and Russell. I chose the option 1 because it's less worse than option 2, but it doesn't mean I fully agree with option 1.

Let's make below things clear if we really go with option 1, otherwise please consider reverting it.

* Do you fully indicate about "all" the paths where the second create table syntax is taken?
* Could you explain "why" to end users without any confusion? Do you think end users will understand it easily?
* Do you have an actual end users to guide to turn this on? Or do you have a plan to turn this on for your team/customers and deal with the ambiguity?
* Could you please document about how things will change if the flag is turned on?

I guess the option 1 is to leave a flag as "undocumented" one and forget about the path to turn on, but I think that would lead to make the feature be "broken window" even we are not able to touch.

On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <[hidden email]> wrote:
I think reverting 30098 is the right decision here if we want to unblock 3.0. We shouldn't ship with features which we know do not function in the way we intend, regardless of how little exposure most users have to them. Even if it's off my default, we should probably work to avoid switches that cause things to behave unpredictably or require a flow chart to actually determine what will happen.

On Mon, May 11, 2020 at 3:07 PM Ryan Blue <[hidden email]> wrote:
I'm all for fixing behavior in master by turning this off as an intermediate step, but I don't think that Spark 3.0 can safely include SPARK-30098.

The problem is that SPARK-30098 introduces strange behavior, as Jungtaek pointed out. And that behavior is not fully understood. While working on a unified CREATE TABLE syntax, I hit additional test failures where the wrong create path was being used.

Unless we plan to NOT support the behavior when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal with this problem for years to come.

On Mon, May 11, 2020 at 1:06 AM JackyLee <[hidden email]> wrote:
+1. Agree with Xiao Li and Jungtaek Lim.

This seems to be controversial, and can not be done in a short time. It is
necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

cloud0fan
SPARK-30098 was merged about 6 months ago. It's not a clean revert and we may need to spend quite a bit of time to resolve conflicts and fix tests.

I don't see why it's still a problem if a feature is disabled and hidden from end-users (it's undocumented, the config is internal). The related code will be replaced in the master branch sooner or later, when we unify the syntaxes.



On Tue, May 12, 2020 at 6:16 AM Ryan Blue <[hidden email]> wrote:
I'm all for getting the unified syntax into master. The only issue appears to be whether or not to pass the presence of the EXTERNAL keyword through to a catalog in v2. Maybe it's time to start a discuss thread for that issue so we're not stuck for another 6 weeks on it.

On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <[hidden email]> wrote:
Btw another wondering here is, is it good to retain the flag on master as an intermediate step? Wouldn't it be better for us to start "unified create table syntax" from scratch?


On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <[hidden email]> wrote:
I'm sorry, but I have to agree with Ryan and Russell. I chose the option 1 because it's less worse than option 2, but it doesn't mean I fully agree with option 1.

Let's make below things clear if we really go with option 1, otherwise please consider reverting it.

* Do you fully indicate about "all" the paths where the second create table syntax is taken?
* Could you explain "why" to end users without any confusion? Do you think end users will understand it easily?
* Do you have an actual end users to guide to turn this on? Or do you have a plan to turn this on for your team/customers and deal with the ambiguity?
* Could you please document about how things will change if the flag is turned on?

I guess the option 1 is to leave a flag as "undocumented" one and forget about the path to turn on, but I think that would lead to make the feature be "broken window" even we are not able to touch.

On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <[hidden email]> wrote:
I think reverting 30098 is the right decision here if we want to unblock 3.0. We shouldn't ship with features which we know do not function in the way we intend, regardless of how little exposure most users have to them. Even if it's off my default, we should probably work to avoid switches that cause things to behave unpredictably or require a flow chart to actually determine what will happen.

On Mon, May 11, 2020 at 3:07 PM Ryan Blue <[hidden email]> wrote:
I'm all for fixing behavior in master by turning this off as an intermediate step, but I don't think that Spark 3.0 can safely include SPARK-30098.

The problem is that SPARK-30098 introduces strange behavior, as Jungtaek pointed out. And that behavior is not fully understood. While working on a unified CREATE TABLE syntax, I hit additional test failures where the wrong create path was being used.

Unless we plan to NOT support the behavior when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal with this problem for years to come.

On Mon, May 11, 2020 at 1:06 AM JackyLee <[hidden email]> wrote:
+1. Agree with Xiao Li and Jungtaek Lim.

This seems to be controversial, and can not be done in a short time. It is
necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Jungtaek Lim-2
It's not only for end users, but also for us. Spark itself uses the config "true" and "false" in tests and it still brings confusion. We still have to deal with both situations.

I'm wondering how long days it would be needed to revert it cleanly, but if we worry about the amount of code change just around the new RC, what about make the code dirty (should be fixed soon) but less headache via applying traditional (and bad) way?

Let's just remove the config so that the config cannot be used in any way (even in Spark codebase), and set corresponding field in parser to the constant value so that no one can modify in any way. This would make the dead code by intention which should be cleaned it up later, so let's add FIXME comment there so that anyone can take it up for cleaning up the code later. (If no one volunteers then I'll probably pick up.)

That is a bad pattern, but still better as we prevent end users (even early adopters) go through the undocumented path in any way, and that will be explicitly marked as "should be fixed". This is different from retaining config - I don't expect unified create table syntax will be landed in bugfix version, so even unified create table syntax can be landed in 3.1.0 (this is also not guaranteed) the config will live in 3.0.x in any way. If we temporarily go dirty way then we can clean up the code in any version, even from bugfix version, maybe within a couple of weeks just after 3.0.0 is released.

Does it sound valid?

On Tue, May 12, 2020 at 2:35 PM Wenchen Fan <[hidden email]> wrote:
SPARK-30098 was merged about 6 months ago. It's not a clean revert and we may need to spend quite a bit of time to resolve conflicts and fix tests.

I don't see why it's still a problem if a feature is disabled and hidden from end-users (it's undocumented, the config is internal). The related code will be replaced in the master branch sooner or later, when we unify the syntaxes.



On Tue, May 12, 2020 at 6:16 AM Ryan Blue <[hidden email]> wrote:
I'm all for getting the unified syntax into master. The only issue appears to be whether or not to pass the presence of the EXTERNAL keyword through to a catalog in v2. Maybe it's time to start a discuss thread for that issue so we're not stuck for another 6 weeks on it.

On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <[hidden email]> wrote:
Btw another wondering here is, is it good to retain the flag on master as an intermediate step? Wouldn't it be better for us to start "unified create table syntax" from scratch?


On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <[hidden email]> wrote:
I'm sorry, but I have to agree with Ryan and Russell. I chose the option 1 because it's less worse than option 2, but it doesn't mean I fully agree with option 1.

Let's make below things clear if we really go with option 1, otherwise please consider reverting it.

* Do you fully indicate about "all" the paths where the second create table syntax is taken?
* Could you explain "why" to end users without any confusion? Do you think end users will understand it easily?
* Do you have an actual end users to guide to turn this on? Or do you have a plan to turn this on for your team/customers and deal with the ambiguity?
* Could you please document about how things will change if the flag is turned on?

I guess the option 1 is to leave a flag as "undocumented" one and forget about the path to turn on, but I think that would lead to make the feature be "broken window" even we are not able to touch.

On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <[hidden email]> wrote:
I think reverting 30098 is the right decision here if we want to unblock 3.0. We shouldn't ship with features which we know do not function in the way we intend, regardless of how little exposure most users have to them. Even if it's off my default, we should probably work to avoid switches that cause things to behave unpredictably or require a flow chart to actually determine what will happen.

On Mon, May 11, 2020 at 3:07 PM Ryan Blue <[hidden email]> wrote:
I'm all for fixing behavior in master by turning this off as an intermediate step, but I don't think that Spark 3.0 can safely include SPARK-30098.

The problem is that SPARK-30098 introduces strange behavior, as Jungtaek pointed out. And that behavior is not fully understood. While working on a unified CREATE TABLE syntax, I hit additional test failures where the wrong create path was being used.

Unless we plan to NOT support the behavior when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal with this problem for years to come.

On Mon, May 11, 2020 at 1:06 AM JackyLee <[hidden email]> wrote:
+1. Agree with Xiao Li and Jungtaek Lim.

This seems to be controversial, and can not be done in a short time. It is
necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Jungtaek Lim-2
Before I forget, we'd better not forget to change the doc, as create table doc looks to represent current syntax which will be incorrect later.

On Tue, May 12, 2020 at 5:32 PM Jungtaek Lim <[hidden email]> wrote:
It's not only for end users, but also for us. Spark itself uses the config "true" and "false" in tests and it still brings confusion. We still have to deal with both situations.

I'm wondering how long days it would be needed to revert it cleanly, but if we worry about the amount of code change just around the new RC, what about make the code dirty (should be fixed soon) but less headache via applying traditional (and bad) way?

Let's just remove the config so that the config cannot be used in any way (even in Spark codebase), and set corresponding field in parser to the constant value so that no one can modify in any way. This would make the dead code by intention which should be cleaned it up later, so let's add FIXME comment there so that anyone can take it up for cleaning up the code later. (If no one volunteers then I'll probably pick up.)

That is a bad pattern, but still better as we prevent end users (even early adopters) go through the undocumented path in any way, and that will be explicitly marked as "should be fixed". This is different from retaining config - I don't expect unified create table syntax will be landed in bugfix version, so even unified create table syntax can be landed in 3.1.0 (this is also not guaranteed) the config will live in 3.0.x in any way. If we temporarily go dirty way then we can clean up the code in any version, even from bugfix version, maybe within a couple of weeks just after 3.0.0 is released.

Does it sound valid?

On Tue, May 12, 2020 at 2:35 PM Wenchen Fan <[hidden email]> wrote:
SPARK-30098 was merged about 6 months ago. It's not a clean revert and we may need to spend quite a bit of time to resolve conflicts and fix tests.

I don't see why it's still a problem if a feature is disabled and hidden from end-users (it's undocumented, the config is internal). The related code will be replaced in the master branch sooner or later, when we unify the syntaxes.



On Tue, May 12, 2020 at 6:16 AM Ryan Blue <[hidden email]> wrote:
I'm all for getting the unified syntax into master. The only issue appears to be whether or not to pass the presence of the EXTERNAL keyword through to a catalog in v2. Maybe it's time to start a discuss thread for that issue so we're not stuck for another 6 weeks on it.

On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <[hidden email]> wrote:
Btw another wondering here is, is it good to retain the flag on master as an intermediate step? Wouldn't it be better for us to start "unified create table syntax" from scratch?


On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <[hidden email]> wrote:
I'm sorry, but I have to agree with Ryan and Russell. I chose the option 1 because it's less worse than option 2, but it doesn't mean I fully agree with option 1.

Let's make below things clear if we really go with option 1, otherwise please consider reverting it.

* Do you fully indicate about "all" the paths where the second create table syntax is taken?
* Could you explain "why" to end users without any confusion? Do you think end users will understand it easily?
* Do you have an actual end users to guide to turn this on? Or do you have a plan to turn this on for your team/customers and deal with the ambiguity?
* Could you please document about how things will change if the flag is turned on?

I guess the option 1 is to leave a flag as "undocumented" one and forget about the path to turn on, but I think that would lead to make the feature be "broken window" even we are not able to touch.

On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <[hidden email]> wrote:
I think reverting 30098 is the right decision here if we want to unblock 3.0. We shouldn't ship with features which we know do not function in the way we intend, regardless of how little exposure most users have to them. Even if it's off my default, we should probably work to avoid switches that cause things to behave unpredictably or require a flow chart to actually determine what will happen.

On Mon, May 11, 2020 at 3:07 PM Ryan Blue <[hidden email]> wrote:
I'm all for fixing behavior in master by turning this off as an intermediate step, but I don't think that Spark 3.0 can safely include SPARK-30098.

The problem is that SPARK-30098 introduces strange behavior, as Jungtaek pointed out. And that behavior is not fully understood. While working on a unified CREATE TABLE syntax, I hit additional test failures where the wrong create path was being used.

Unless we plan to NOT support the behavior when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal with this problem for years to come.

On Mon, May 11, 2020 at 1:06 AM JackyLee <[hidden email]> wrote:
+1. Agree with Xiao Li and Jungtaek Lim.

This seems to be controversial, and can not be done in a short time. It is
necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Ryan Blue
+1 for the approach Jungtaek suggests. That will avoid needing to support behavior that is not well understood with minimal changes.

On Tue, May 12, 2020 at 1:45 AM Jungtaek Lim <[hidden email]> wrote:
Before I forget, we'd better not forget to change the doc, as create table doc looks to represent current syntax which will be incorrect later.

On Tue, May 12, 2020 at 5:32 PM Jungtaek Lim <[hidden email]> wrote:
It's not only for end users, but also for us. Spark itself uses the config "true" and "false" in tests and it still brings confusion. We still have to deal with both situations.

I'm wondering how long days it would be needed to revert it cleanly, but if we worry about the amount of code change just around the new RC, what about make the code dirty (should be fixed soon) but less headache via applying traditional (and bad) way?

Let's just remove the config so that the config cannot be used in any way (even in Spark codebase), and set corresponding field in parser to the constant value so that no one can modify in any way. This would make the dead code by intention which should be cleaned it up later, so let's add FIXME comment there so that anyone can take it up for cleaning up the code later. (If no one volunteers then I'll probably pick up.)

That is a bad pattern, but still better as we prevent end users (even early adopters) go through the undocumented path in any way, and that will be explicitly marked as "should be fixed". This is different from retaining config - I don't expect unified create table syntax will be landed in bugfix version, so even unified create table syntax can be landed in 3.1.0 (this is also not guaranteed) the config will live in 3.0.x in any way. If we temporarily go dirty way then we can clean up the code in any version, even from bugfix version, maybe within a couple of weeks just after 3.0.0 is released.

Does it sound valid?

On Tue, May 12, 2020 at 2:35 PM Wenchen Fan <[hidden email]> wrote:
SPARK-30098 was merged about 6 months ago. It's not a clean revert and we may need to spend quite a bit of time to resolve conflicts and fix tests.

I don't see why it's still a problem if a feature is disabled and hidden from end-users (it's undocumented, the config is internal). The related code will be replaced in the master branch sooner or later, when we unify the syntaxes.



On Tue, May 12, 2020 at 6:16 AM Ryan Blue <[hidden email]> wrote:
I'm all for getting the unified syntax into master. The only issue appears to be whether or not to pass the presence of the EXTERNAL keyword through to a catalog in v2. Maybe it's time to start a discuss thread for that issue so we're not stuck for another 6 weeks on it.

On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <[hidden email]> wrote:
Btw another wondering here is, is it good to retain the flag on master as an intermediate step? Wouldn't it be better for us to start "unified create table syntax" from scratch?


On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <[hidden email]> wrote:
I'm sorry, but I have to agree with Ryan and Russell. I chose the option 1 because it's less worse than option 2, but it doesn't mean I fully agree with option 1.

Let's make below things clear if we really go with option 1, otherwise please consider reverting it.

* Do you fully indicate about "all" the paths where the second create table syntax is taken?
* Could you explain "why" to end users without any confusion? Do you think end users will understand it easily?
* Do you have an actual end users to guide to turn this on? Or do you have a plan to turn this on for your team/customers and deal with the ambiguity?
* Could you please document about how things will change if the flag is turned on?

I guess the option 1 is to leave a flag as "undocumented" one and forget about the path to turn on, but I think that would lead to make the feature be "broken window" even we are not able to touch.

On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <[hidden email]> wrote:
I think reverting 30098 is the right decision here if we want to unblock 3.0. We shouldn't ship with features which we know do not function in the way we intend, regardless of how little exposure most users have to them. Even if it's off my default, we should probably work to avoid switches that cause things to behave unpredictably or require a flow chart to actually determine what will happen.

On Mon, May 11, 2020 at 3:07 PM Ryan Blue <[hidden email]> wrote:
I'm all for fixing behavior in master by turning this off as an intermediate step, but I don't think that Spark 3.0 can safely include SPARK-30098.

The problem is that SPARK-30098 introduces strange behavior, as Jungtaek pointed out. And that behavior is not fully understood. While working on a unified CREATE TABLE syntax, I hit additional test failures where the wrong create path was being used.

Unless we plan to NOT support the behavior when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal with this problem for years to come.

On Mon, May 11, 2020 at 1:06 AM JackyLee <[hidden email]> wrote:
+1. Agree with Xiao Li and Jungtaek Lim.

This seems to be controversial, and can not be done in a short time. It is
necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

RussS
I think that the dead code approach, while a bit unpalatable and worse than reverting, is probably better than leaving the parameter (even if it is hidden)

On Tue, May 12, 2020 at 12:46 PM Ryan Blue <[hidden email]> wrote:
+1 for the approach Jungtaek suggests. That will avoid needing to support behavior that is not well understood with minimal changes.

On Tue, May 12, 2020 at 1:45 AM Jungtaek Lim <[hidden email]> wrote:
Before I forget, we'd better not forget to change the doc, as create table doc looks to represent current syntax which will be incorrect later.

On Tue, May 12, 2020 at 5:32 PM Jungtaek Lim <[hidden email]> wrote:
It's not only for end users, but also for us. Spark itself uses the config "true" and "false" in tests and it still brings confusion. We still have to deal with both situations.

I'm wondering how long days it would be needed to revert it cleanly, but if we worry about the amount of code change just around the new RC, what about make the code dirty (should be fixed soon) but less headache via applying traditional (and bad) way?

Let's just remove the config so that the config cannot be used in any way (even in Spark codebase), and set corresponding field in parser to the constant value so that no one can modify in any way. This would make the dead code by intention which should be cleaned it up later, so let's add FIXME comment there so that anyone can take it up for cleaning up the code later. (If no one volunteers then I'll probably pick up.)

That is a bad pattern, but still better as we prevent end users (even early adopters) go through the undocumented path in any way, and that will be explicitly marked as "should be fixed". This is different from retaining config - I don't expect unified create table syntax will be landed in bugfix version, so even unified create table syntax can be landed in 3.1.0 (this is also not guaranteed) the config will live in 3.0.x in any way. If we temporarily go dirty way then we can clean up the code in any version, even from bugfix version, maybe within a couple of weeks just after 3.0.0 is released.

Does it sound valid?

On Tue, May 12, 2020 at 2:35 PM Wenchen Fan <[hidden email]> wrote:
SPARK-30098 was merged about 6 months ago. It's not a clean revert and we may need to spend quite a bit of time to resolve conflicts and fix tests.

I don't see why it's still a problem if a feature is disabled and hidden from end-users (it's undocumented, the config is internal). The related code will be replaced in the master branch sooner or later, when we unify the syntaxes.



On Tue, May 12, 2020 at 6:16 AM Ryan Blue <[hidden email]> wrote:
I'm all for getting the unified syntax into master. The only issue appears to be whether or not to pass the presence of the EXTERNAL keyword through to a catalog in v2. Maybe it's time to start a discuss thread for that issue so we're not stuck for another 6 weeks on it.

On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <[hidden email]> wrote:
Btw another wondering here is, is it good to retain the flag on master as an intermediate step? Wouldn't it be better for us to start "unified create table syntax" from scratch?


On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <[hidden email]> wrote:
I'm sorry, but I have to agree with Ryan and Russell. I chose the option 1 because it's less worse than option 2, but it doesn't mean I fully agree with option 1.

Let's make below things clear if we really go with option 1, otherwise please consider reverting it.

* Do you fully indicate about "all" the paths where the second create table syntax is taken?
* Could you explain "why" to end users without any confusion? Do you think end users will understand it easily?
* Do you have an actual end users to guide to turn this on? Or do you have a plan to turn this on for your team/customers and deal with the ambiguity?
* Could you please document about how things will change if the flag is turned on?

I guess the option 1 is to leave a flag as "undocumented" one and forget about the path to turn on, but I think that would lead to make the feature be "broken window" even we are not able to touch.

On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <[hidden email]> wrote:
I think reverting 30098 is the right decision here if we want to unblock 3.0. We shouldn't ship with features which we know do not function in the way we intend, regardless of how little exposure most users have to them. Even if it's off my default, we should probably work to avoid switches that cause things to behave unpredictably or require a flow chart to actually determine what will happen.

On Mon, May 11, 2020 at 3:07 PM Ryan Blue <[hidden email]> wrote:
I'm all for fixing behavior in master by turning this off as an intermediate step, but I don't think that Spark 3.0 can safely include SPARK-30098.

The problem is that SPARK-30098 introduces strange behavior, as Jungtaek pointed out. And that behavior is not fully understood. While working on a unified CREATE TABLE syntax, I hit additional test failures where the wrong create path was being used.

Unless we plan to NOT support the behavior when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal with this problem for years to come.

On Mon, May 11, 2020 at 1:06 AM JackyLee <[hidden email]> wrote:
+1. Agree with Xiao Li and Jungtaek Lim.

This seems to be controversial, and can not be done in a short time. It is
necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Jungtaek Lim-2
Never mind, forget about the dead code. Turned out that reverting "via manual" can be very easily done - remove config and apply to the tests -> remove field -> remove the changes into docs. It was considered as non-trivial because we only consider about "git revert" but there's no strict rule to do so.

Please take a look at this PR, https://github.com/apache/spark/pull/28517

On Wed, May 13, 2020 at 3:10 AM Russell Spitzer <[hidden email]> wrote:
I think that the dead code approach, while a bit unpalatable and worse than reverting, is probably better than leaving the parameter (even if it is hidden)

On Tue, May 12, 2020 at 12:46 PM Ryan Blue <[hidden email]> wrote:
+1 for the approach Jungtaek suggests. That will avoid needing to support behavior that is not well understood with minimal changes.

On Tue, May 12, 2020 at 1:45 AM Jungtaek Lim <[hidden email]> wrote:
Before I forget, we'd better not forget to change the doc, as create table doc looks to represent current syntax which will be incorrect later.

On Tue, May 12, 2020 at 5:32 PM Jungtaek Lim <[hidden email]> wrote:
It's not only for end users, but also for us. Spark itself uses the config "true" and "false" in tests and it still brings confusion. We still have to deal with both situations.

I'm wondering how long days it would be needed to revert it cleanly, but if we worry about the amount of code change just around the new RC, what about make the code dirty (should be fixed soon) but less headache via applying traditional (and bad) way?

Let's just remove the config so that the config cannot be used in any way (even in Spark codebase), and set corresponding field in parser to the constant value so that no one can modify in any way. This would make the dead code by intention which should be cleaned it up later, so let's add FIXME comment there so that anyone can take it up for cleaning up the code later. (If no one volunteers then I'll probably pick up.)

That is a bad pattern, but still better as we prevent end users (even early adopters) go through the undocumented path in any way, and that will be explicitly marked as "should be fixed". This is different from retaining config - I don't expect unified create table syntax will be landed in bugfix version, so even unified create table syntax can be landed in 3.1.0 (this is also not guaranteed) the config will live in 3.0.x in any way. If we temporarily go dirty way then we can clean up the code in any version, even from bugfix version, maybe within a couple of weeks just after 3.0.0 is released.

Does it sound valid?

On Tue, May 12, 2020 at 2:35 PM Wenchen Fan <[hidden email]> wrote:
SPARK-30098 was merged about 6 months ago. It's not a clean revert and we may need to spend quite a bit of time to resolve conflicts and fix tests.

I don't see why it's still a problem if a feature is disabled and hidden from end-users (it's undocumented, the config is internal). The related code will be replaced in the master branch sooner or later, when we unify the syntaxes.



On Tue, May 12, 2020 at 6:16 AM Ryan Blue <[hidden email]> wrote:
I'm all for getting the unified syntax into master. The only issue appears to be whether or not to pass the presence of the EXTERNAL keyword through to a catalog in v2. Maybe it's time to start a discuss thread for that issue so we're not stuck for another 6 weeks on it.

On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <[hidden email]> wrote:
Btw another wondering here is, is it good to retain the flag on master as an intermediate step? Wouldn't it be better for us to start "unified create table syntax" from scratch?


On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <[hidden email]> wrote:
I'm sorry, but I have to agree with Ryan and Russell. I chose the option 1 because it's less worse than option 2, but it doesn't mean I fully agree with option 1.

Let's make below things clear if we really go with option 1, otherwise please consider reverting it.

* Do you fully indicate about "all" the paths where the second create table syntax is taken?
* Could you explain "why" to end users without any confusion? Do you think end users will understand it easily?
* Do you have an actual end users to guide to turn this on? Or do you have a plan to turn this on for your team/customers and deal with the ambiguity?
* Could you please document about how things will change if the flag is turned on?

I guess the option 1 is to leave a flag as "undocumented" one and forget about the path to turn on, but I think that would lead to make the feature be "broken window" even we are not able to touch.

On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <[hidden email]> wrote:
I think reverting 30098 is the right decision here if we want to unblock 3.0. We shouldn't ship with features which we know do not function in the way we intend, regardless of how little exposure most users have to them. Even if it's off my default, we should probably work to avoid switches that cause things to behave unpredictably or require a flow chart to actually determine what will happen.

On Mon, May 11, 2020 at 3:07 PM Ryan Blue <[hidden email]> wrote:
I'm all for fixing behavior in master by turning this off as an intermediate step, but I don't think that Spark 3.0 can safely include SPARK-30098.

The problem is that SPARK-30098 introduces strange behavior, as Jungtaek pointed out. And that behavior is not fully understood. While working on a unified CREATE TABLE syntax, I hit additional test failures where the wrong create path was being used.

Unless we plan to NOT support the behavior when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal with this problem for years to come.

On Mon, May 11, 2020 at 1:06 AM JackyLee <[hidden email]> wrote:
+1. Agree with Xiao Li and Jungtaek Lim.

This seems to be controversial, and can not be done in a short time. It is
necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
12