[DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Dongjoon Hyun-2
Hi, All.

I'm sending this email because the Apache Spark committers had better have a consistent point of views for the upcoming PRs. And, the community policy is the way to lead the community members transparently and clearly for a long term good.

First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache Spark 3.0.0 is going to achieve many improvement in SQL areas.

Especially, we invested lots of efforts to improve SQL ANSI compliance and related test coverage for the future. One simple example is the following.

    [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax

With this support, we are able to move away from one of esoteric Spark function `TRIM/LTRIM/RTRIM`.
(Note that the above syntax is ANSI standard, but we have additional non-standard these functions, too.)

Here is the summary of the current status.

    +------------------------+-------------+---------------+
    | SQL Progressing Engine | TRIM Syntax | TRIM Function |
    +------------------------------------------------------+
    | Spark 3.0.0-preview/2  |      O      |       O       |
    | PostgreSQL             |      O      |       O       |
    | Presto                 |      O      |       O       |
    | MySQL                  |      O(*)   |       X       |
    | Oracle                 |      O      |       X       |
    | MsSQL                  |      O      |       X       |
    | Terradata              |      O      |       X       |
    | Hive                   |      X      |       X       |
    | Spark 1.6 ~ 2.2        |      X      |       X       |
    | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
    +------------------------+-------------+---------------+
    * MySQL doesn't allow multiple trim character
    * Spark 2.3 ~ 2.4 have the function in a different way.

Here is the illustrative example of the problem.

    postgres=# SELECT trim('yxTomxx', 'xyz');
    btrim
    -------
    Tom

    presto:default> SELECT trim('yxTomxx', 'xyz');
    _col0
    -------
    Tom

    spark-sql> SELECT trim('yxTomxx', 'xyz');
    z

Here is our history to fix the above issue.

    [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue
    1. https://github.com/apache/spark/pull/24902
       (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2 released.)
    2. https://github.com/apache/spark/pull/24907
       (Merged 2019-06-20 for v2.3.4, but reverted)
    3. https://github.com/apache/spark/pull/24908
       (Merged 2019-06-21 for v2.4.4, but reverted)

(2) and (3) were reverted before releases because we didn't want to fix that in the maintenance releases. Please see the following references of the decision.

    https://github.com/apache/spark/pull/24908#issuecomment-504799028 (2.3)
    https://github.com/apache/spark/pull/24907#issuecomment-504799021 (2.4)

Now, there are some requests to revert SPARK-28093 and to keep these esoteric functions for backward compatibility and the following reason.

    > Reordering function parameters to match another system,
    > for a method that is otherwise working correctly,
    > sounds exactly like a cosmetic change to me.

    > How can we silently change the parameter of an existing SQL function?
    > I don't think this is a correctness issue as the SQL standard
    > doesn't say that the function signature have to be trim(srcStr, trimStr).

The concern and the point of views make sense.

My concerns are the followings.

    1. This kind of esoteric differences are called `vendor-lock-in` stuffs in a negative way.
       - It's difficult for new users to understand.
       - It's hard to migrate between Apache Spark and the others.
    2. Although we did our bests, Apache Spark SQL has not been enough always.
    3. We need to do improvement in the future releases.

In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093 in order to keep this vendor-lock-in stuffs for backward-compatibility.

What I want is building a consistent point of views in this category for the upcoming PR reviews.

If we have a clear policy, we can save future community efforts in many ways.

Bests,
Dongjoon.

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Dongjoon Hyun-2
Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and TRIM(trimStr FROM str) syntax.

This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.

On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

I'm sending this email because the Apache Spark committers had better have a consistent point of views for the upcoming PRs. And, the community policy is the way to lead the community members transparently and clearly for a long term good.

First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache Spark 3.0.0 is going to achieve many improvement in SQL areas.

Especially, we invested lots of efforts to improve SQL ANSI compliance and related test coverage for the future. One simple example is the following.

    [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax

With this support, we are able to move away from one of esoteric Spark function `TRIM/LTRIM/RTRIM`.
(Note that the above syntax is ANSI standard, but we have additional non-standard these functions, too.)

Here is the summary of the current status.

    +------------------------+-------------+---------------+
    | SQL Progressing Engine | TRIM Syntax | TRIM Function |
    +------------------------------------------------------+
    | Spark 3.0.0-preview/2  |      O      |       O       |
    | PostgreSQL             |      O      |       O       |
    | Presto                 |      O      |       O       |
    | MySQL                  |      O(*)   |       X       |
    | Oracle                 |      O      |       X       |
    | MsSQL                  |      O      |       X       |
    | Terradata              |      O      |       X       |
    | Hive                   |      X      |       X       |
    | Spark 1.6 ~ 2.2        |      X      |       X       |
    | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
    +------------------------+-------------+---------------+
    * MySQL doesn't allow multiple trim character
    * Spark 2.3 ~ 2.4 have the function in a different way.

Here is the illustrative example of the problem.

    postgres=# SELECT trim('yxTomxx', 'xyz');
    btrim
    -------
    Tom

    presto:default> SELECT trim('yxTomxx', 'xyz');
    _col0
    -------
    Tom

    spark-sql> SELECT trim('yxTomxx', 'xyz');
    z

Here is our history to fix the above issue.

    [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue
    1. https://github.com/apache/spark/pull/24902
       (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2 released.)
    2. https://github.com/apache/spark/pull/24907
       (Merged 2019-06-20 for v2.3.4, but reverted)
    3. https://github.com/apache/spark/pull/24908
       (Merged 2019-06-21 for v2.4.4, but reverted)

(2) and (3) were reverted before releases because we didn't want to fix that in the maintenance releases. Please see the following references of the decision.

    https://github.com/apache/spark/pull/24908#issuecomment-504799028 (2.3)
    https://github.com/apache/spark/pull/24907#issuecomment-504799021 (2.4)

Now, there are some requests to revert SPARK-28093 and to keep these esoteric functions for backward compatibility and the following reason.

    > Reordering function parameters to match another system,
    > for a method that is otherwise working correctly,
    > sounds exactly like a cosmetic change to me.

    > How can we silently change the parameter of an existing SQL function?
    > I don't think this is a correctness issue as the SQL standard
    > doesn't say that the function signature have to be trim(srcStr, trimStr).

The concern and the point of views make sense.

My concerns are the followings.

    1. This kind of esoteric differences are called `vendor-lock-in` stuffs in a negative way.
       - It's difficult for new users to understand.
       - It's hard to migrate between Apache Spark and the others.
    2. Although we did our bests, Apache Spark SQL has not been enough always.
    3. We need to do improvement in the future releases.

In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093 in order to keep this vendor-lock-in stuffs for backward-compatibility.

What I want is building a consistent point of views in this category for the upcoming PR reviews.

If we have a clear policy, we can save future community efforts in many ways.

Bests,
Dongjoon.

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

cloud0fan
It's unfortunate that we don't have a clear document to talk about breaking changes (I'm working on it BTW). I believe the general guidance is: avoid breaking changes unless we have to. For example, the previous result was so broken that we have to fix it, moving to Scala 2.12 makes us have to break some APIs, etc.

For this particular case, do we have to switch the parameter order? It's different from some systems, the order was not decided explicitly, but I don't think they are strong reasons. People from RDBMS should use the SQL standard TRIM syntax more often. People using prior Spark versions should have figured out the parameter order of Spark TRIM (there was no document) and adjusted their queries. There is no such standard that defines the parameter order of the TRIM function.

In the long term, we would also promote the SQL standard TRIM syntax. I don't see much benefit of "fixing" the parameter order that worth to make a breaking change.

Thanks,
Wenchen



On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun <[hidden email]> wrote:
Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and TRIM(trimStr FROM str) syntax.

This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.

On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

I'm sending this email because the Apache Spark committers had better have a consistent point of views for the upcoming PRs. And, the community policy is the way to lead the community members transparently and clearly for a long term good.

First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache Spark 3.0.0 is going to achieve many improvement in SQL areas.

Especially, we invested lots of efforts to improve SQL ANSI compliance and related test coverage for the future. One simple example is the following.

    [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax

With this support, we are able to move away from one of esoteric Spark function `TRIM/LTRIM/RTRIM`.
(Note that the above syntax is ANSI standard, but we have additional non-standard these functions, too.)

Here is the summary of the current status.

    +------------------------+-------------+---------------+
    | SQL Progressing Engine | TRIM Syntax | TRIM Function |
    +------------------------------------------------------+
    | Spark 3.0.0-preview/2  |      O      |       O       |
    | PostgreSQL             |      O      |       O       |
    | Presto                 |      O      |       O       |
    | MySQL                  |      O(*)   |       X       |
    | Oracle                 |      O      |       X       |
    | MsSQL                  |      O      |       X       |
    | Terradata              |      O      |       X       |
    | Hive                   |      X      |       X       |
    | Spark 1.6 ~ 2.2        |      X      |       X       |
    | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
    +------------------------+-------------+---------------+
    * MySQL doesn't allow multiple trim character
    * Spark 2.3 ~ 2.4 have the function in a different way.

Here is the illustrative example of the problem.

    postgres=# SELECT trim('yxTomxx', 'xyz');
    btrim
    -------
    Tom

    presto:default> SELECT trim('yxTomxx', 'xyz');
    _col0
    -------
    Tom

    spark-sql> SELECT trim('yxTomxx', 'xyz');
    z

Here is our history to fix the above issue.

    [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue
    1. https://github.com/apache/spark/pull/24902
       (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2 released.)
    2. https://github.com/apache/spark/pull/24907
       (Merged 2019-06-20 for v2.3.4, but reverted)
    3. https://github.com/apache/spark/pull/24908
       (Merged 2019-06-21 for v2.4.4, but reverted)

(2) and (3) were reverted before releases because we didn't want to fix that in the maintenance releases. Please see the following references of the decision.

    https://github.com/apache/spark/pull/24908#issuecomment-504799028 (2.3)
    https://github.com/apache/spark/pull/24907#issuecomment-504799021 (2.4)

Now, there are some requests to revert SPARK-28093 and to keep these esoteric functions for backward compatibility and the following reason.

    > Reordering function parameters to match another system,
    > for a method that is otherwise working correctly,
    > sounds exactly like a cosmetic change to me.

    > How can we silently change the parameter of an existing SQL function?
    > I don't think this is a correctness issue as the SQL standard
    > doesn't say that the function signature have to be trim(srcStr, trimStr).

The concern and the point of views make sense.

My concerns are the followings.

    1. This kind of esoteric differences are called `vendor-lock-in` stuffs in a negative way.
       - It's difficult for new users to understand.
       - It's hard to migrate between Apache Spark and the others.
    2. Although we did our bests, Apache Spark SQL has not been enough always.
    3. We need to do improvement in the future releases.

In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093 in order to keep this vendor-lock-in stuffs for backward-compatibility.

What I want is building a consistent point of views in this category for the upcoming PR reviews.

If we have a clear policy, we can save future community efforts in many ways.

Bests,
Dongjoon.

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Maxim Gekk
Also if we look at possible combination of trim parameters:
1. foldable srcStr + foldable trimStr
2. foldable srcStr + non-foldable trimStr
3. non-foldable srcStr + foldable trimStr
4. non-foldable srcStr + non-foldable trimStr

The case # 2 seems a rare case, and # 3 is probably the most common case. Once we see the second case, we could output a warning with a notice about the order of parameters.

Maxim Gekk

Software Engineer

Databricks, Inc.



On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan <[hidden email]> wrote:
It's unfortunate that we don't have a clear document to talk about breaking changes (I'm working on it BTW). I believe the general guidance is: avoid breaking changes unless we have to. For example, the previous result was so broken that we have to fix it, moving to Scala 2.12 makes us have to break some APIs, etc.

For this particular case, do we have to switch the parameter order? It's different from some systems, the order was not decided explicitly, but I don't think they are strong reasons. People from RDBMS should use the SQL standard TRIM syntax more often. People using prior Spark versions should have figured out the parameter order of Spark TRIM (there was no document) and adjusted their queries. There is no such standard that defines the parameter order of the TRIM function.

In the long term, we would also promote the SQL standard TRIM syntax. I don't see much benefit of "fixing" the parameter order that worth to make a breaking change.

Thanks,
Wenchen



On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun <[hidden email]> wrote:
Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and TRIM(trimStr FROM str) syntax.

This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.

On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

I'm sending this email because the Apache Spark committers had better have a consistent point of views for the upcoming PRs. And, the community policy is the way to lead the community members transparently and clearly for a long term good.

First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache Spark 3.0.0 is going to achieve many improvement in SQL areas.

Especially, we invested lots of efforts to improve SQL ANSI compliance and related test coverage for the future. One simple example is the following.

    [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax

With this support, we are able to move away from one of esoteric Spark function `TRIM/LTRIM/RTRIM`.
(Note that the above syntax is ANSI standard, but we have additional non-standard these functions, too.)

Here is the summary of the current status.

    +------------------------+-------------+---------------+
    | SQL Progressing Engine | TRIM Syntax | TRIM Function |
    +------------------------------------------------------+
    | Spark 3.0.0-preview/2  |      O      |       O       |
    | PostgreSQL             |      O      |       O       |
    | Presto                 |      O      |       O       |
    | MySQL                  |      O(*)   |       X       |
    | Oracle                 |      O      |       X       |
    | MsSQL                  |      O      |       X       |
    | Terradata              |      O      |       X       |
    | Hive                   |      X      |       X       |
    | Spark 1.6 ~ 2.2        |      X      |       X       |
    | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
    +------------------------+-------------+---------------+
    * MySQL doesn't allow multiple trim character
    * Spark 2.3 ~ 2.4 have the function in a different way.

Here is the illustrative example of the problem.

    postgres=# SELECT trim('yxTomxx', 'xyz');
    btrim
    -------
    Tom

    presto:default> SELECT trim('yxTomxx', 'xyz');
    _col0
    -------
    Tom

    spark-sql> SELECT trim('yxTomxx', 'xyz');
    z

Here is our history to fix the above issue.

    [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue
    1. https://github.com/apache/spark/pull/24902
       (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2 released.)
    2. https://github.com/apache/spark/pull/24907
       (Merged 2019-06-20 for v2.3.4, but reverted)
    3. https://github.com/apache/spark/pull/24908
       (Merged 2019-06-21 for v2.4.4, but reverted)

(2) and (3) were reverted before releases because we didn't want to fix that in the maintenance releases. Please see the following references of the decision.

    https://github.com/apache/spark/pull/24908#issuecomment-504799028 (2.3)
    https://github.com/apache/spark/pull/24907#issuecomment-504799021 (2.4)

Now, there are some requests to revert SPARK-28093 and to keep these esoteric functions for backward compatibility and the following reason.

    > Reordering function parameters to match another system,
    > for a method that is otherwise working correctly,
    > sounds exactly like a cosmetic change to me.

    > How can we silently change the parameter of an existing SQL function?
    > I don't think this is a correctness issue as the SQL standard
    > doesn't say that the function signature have to be trim(srcStr, trimStr).

The concern and the point of views make sense.

My concerns are the followings.

    1. This kind of esoteric differences are called `vendor-lock-in` stuffs in a negative way.
       - It's difficult for new users to understand.
       - It's hard to migrate between Apache Spark and the others.
    2. Although we did our bests, Apache Spark SQL has not been enough always.
    3. We need to do improvement in the future releases.

In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093 in order to keep this vendor-lock-in stuffs for backward-compatibility.

What I want is building a consistent point of views in this category for the upcoming PR reviews.

If we have a clear policy, we can save future community efforts in many ways.

Bests,
Dongjoon.

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Takeshi Yamamuro
The revert looks fine to me for keeping the compatibility.
Also, I think the different orders between the systems easily lead to mistakes, so
, as Wenchen suggested, it looks nice to make these (two parameters) trim functions
unrecommended in future releases:
Actually, I think the SQL TRIM syntax is enough for trim use cases...

Bests,
Takeshi


On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk <[hidden email]> wrote:
Also if we look at possible combination of trim parameters:
1. foldable srcStr + foldable trimStr
2. foldable srcStr + non-foldable trimStr
3. non-foldable srcStr + foldable trimStr
4. non-foldable srcStr + non-foldable trimStr

The case # 2 seems a rare case, and # 3 is probably the most common case. Once we see the second case, we could output a warning with a notice about the order of parameters.

Maxim Gekk

Software Engineer

Databricks, Inc.



On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan <[hidden email]> wrote:
It's unfortunate that we don't have a clear document to talk about breaking changes (I'm working on it BTW). I believe the general guidance is: avoid breaking changes unless we have to. For example, the previous result was so broken that we have to fix it, moving to Scala 2.12 makes us have to break some APIs, etc.

For this particular case, do we have to switch the parameter order? It's different from some systems, the order was not decided explicitly, but I don't think they are strong reasons. People from RDBMS should use the SQL standard TRIM syntax more often. People using prior Spark versions should have figured out the parameter order of Spark TRIM (there was no document) and adjusted their queries. There is no such standard that defines the parameter order of the TRIM function.

In the long term, we would also promote the SQL standard TRIM syntax. I don't see much benefit of "fixing" the parameter order that worth to make a breaking change.

Thanks,
Wenchen



On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun <[hidden email]> wrote:
Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and TRIM(trimStr FROM str) syntax.

This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.

On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

I'm sending this email because the Apache Spark committers had better have a consistent point of views for the upcoming PRs. And, the community policy is the way to lead the community members transparently and clearly for a long term good.

First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache Spark 3.0.0 is going to achieve many improvement in SQL areas.

Especially, we invested lots of efforts to improve SQL ANSI compliance and related test coverage for the future. One simple example is the following.

    [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax

With this support, we are able to move away from one of esoteric Spark function `TRIM/LTRIM/RTRIM`.
(Note that the above syntax is ANSI standard, but we have additional non-standard these functions, too.)

Here is the summary of the current status.

    +------------------------+-------------+---------------+
    | SQL Progressing Engine | TRIM Syntax | TRIM Function |
    +------------------------------------------------------+
    | Spark 3.0.0-preview/2  |      O      |       O       |
    | PostgreSQL             |      O      |       O       |
    | Presto                 |      O      |       O       |
    | MySQL                  |      O(*)   |       X       |
    | Oracle                 |      O      |       X       |
    | MsSQL                  |      O      |       X       |
    | Terradata              |      O      |       X       |
    | Hive                   |      X      |       X       |
    | Spark 1.6 ~ 2.2        |      X      |       X       |
    | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
    +------------------------+-------------+---------------+
    * MySQL doesn't allow multiple trim character
    * Spark 2.3 ~ 2.4 have the function in a different way.

Here is the illustrative example of the problem.

    postgres=# SELECT trim('yxTomxx', 'xyz');
    btrim
    -------
    Tom

    presto:default> SELECT trim('yxTomxx', 'xyz');
    _col0
    -------
    Tom

    spark-sql> SELECT trim('yxTomxx', 'xyz');
    z

Here is our history to fix the above issue.

    [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue
    1. https://github.com/apache/spark/pull/24902
       (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2 released.)
    2. https://github.com/apache/spark/pull/24907
       (Merged 2019-06-20 for v2.3.4, but reverted)
    3. https://github.com/apache/spark/pull/24908
       (Merged 2019-06-21 for v2.4.4, but reverted)

(2) and (3) were reverted before releases because we didn't want to fix that in the maintenance releases. Please see the following references of the decision.

    https://github.com/apache/spark/pull/24908#issuecomment-504799028 (2.3)
    https://github.com/apache/spark/pull/24907#issuecomment-504799021 (2.4)

Now, there are some requests to revert SPARK-28093 and to keep these esoteric functions for backward compatibility and the following reason.

    > Reordering function parameters to match another system,
    > for a method that is otherwise working correctly,
    > sounds exactly like a cosmetic change to me.

    > How can we silently change the parameter of an existing SQL function?
    > I don't think this is a correctness issue as the SQL standard
    > doesn't say that the function signature have to be trim(srcStr, trimStr).

The concern and the point of views make sense.

My concerns are the followings.

    1. This kind of esoteric differences are called `vendor-lock-in` stuffs in a negative way.
       - It's difficult for new users to understand.
       - It's hard to migrate between Apache Spark and the others.
    2. Although we did our bests, Apache Spark SQL has not been enough always.
    3. We need to do improvement in the future releases.

In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093 in order to keep this vendor-lock-in stuffs for backward-compatibility.

What I want is building a consistent point of views in this category for the upcoming PR reviews.

If we have a clear policy, we can save future community efforts in many ways.

Bests,
Dongjoon.



--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Dongjoon Hyun-2
Thank you for feedback, Wenchen, Maxim, and Takeshi.

    1. "we would also promote the SQL standard TRIM syntax"
    2. "we could output a warning with a notice about the order of parameters"
    3. "it looks nice to make these (two parameters) trim functions unrecommended in future releases"

Yes, in case of reverting SPARK-28093, we had better deprecate these two-parameter SQL function invocations. It's because this is really esoteric and even inconsistent inside Spark (Scala API also follows PostgresSQL/Presto style like the following.)

    def trim(e: Column, trimString: String)

If we keep this situation in 3.0.0 release (a major release), it means Apache Spark will be forever.
And, it becomes worse and worse because it leads more users to fall into this trap.

There are a few deprecation ways. I believe (3) can be a proper next step in case of reverting because (2) is infeasible and (1) is considered insufficient because it's silent when we do SPARK-28093. We need non-silent (noisy) one in this case. Technically, (3) can be done in `Analyzer.ResolveFunctions`.

    1. Documentation-only: Removing example and add migration guide
    2. Compile-time warning by annotation: This is not an option for SQL function in SQL string.
    3. Runtime warning with a directional guide
       (log.warn("... USE TRIM(trimStr FROM str) INSTEAD")

How do you think about (3)?
 
Bests,
Dongjoon.

On Sun, Feb 16, 2020 at 1:22 AM Takeshi Yamamuro <[hidden email]> wrote:
The revert looks fine to me for keeping the compatibility.
Also, I think the different orders between the systems easily lead to mistakes, so
, as Wenchen suggested, it looks nice to make these (two parameters) trim functions
unrecommended in future releases:
Actually, I think the SQL TRIM syntax is enough for trim use cases...

Bests,
Takeshi


On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk <[hidden email]> wrote:
Also if we look at possible combination of trim parameters:
1. foldable srcStr + foldable trimStr
2. foldable srcStr + non-foldable trimStr
3. non-foldable srcStr + foldable trimStr
4. non-foldable srcStr + non-foldable trimStr

The case # 2 seems a rare case, and # 3 is probably the most common case. Once we see the second case, we could output a warning with a notice about the order of parameters.

Maxim Gekk

Software Engineer

Databricks, Inc.



On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan <[hidden email]> wrote:
It's unfortunate that we don't have a clear document to talk about breaking changes (I'm working on it BTW). I believe the general guidance is: avoid breaking changes unless we have to. For example, the previous result was so broken that we have to fix it, moving to Scala 2.12 makes us have to break some APIs, etc.

For this particular case, do we have to switch the parameter order? It's different from some systems, the order was not decided explicitly, but I don't think they are strong reasons. People from RDBMS should use the SQL standard TRIM syntax more often. People using prior Spark versions should have figured out the parameter order of Spark TRIM (there was no document) and adjusted their queries. There is no such standard that defines the parameter order of the TRIM function.

In the long term, we would also promote the SQL standard TRIM syntax. I don't see much benefit of "fixing" the parameter order that worth to make a breaking change.

Thanks,
Wenchen



On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun <[hidden email]> wrote:
Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and TRIM(trimStr FROM str) syntax.

This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.

On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

I'm sending this email because the Apache Spark committers had better have a consistent point of views for the upcoming PRs. And, the community policy is the way to lead the community members transparently and clearly for a long term good.

First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache Spark 3.0.0 is going to achieve many improvement in SQL areas.

Especially, we invested lots of efforts to improve SQL ANSI compliance and related test coverage for the future. One simple example is the following.

    [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax

With this support, we are able to move away from one of esoteric Spark function `TRIM/LTRIM/RTRIM`.
(Note that the above syntax is ANSI standard, but we have additional non-standard these functions, too.)

Here is the summary of the current status.

    +------------------------+-------------+---------------+
    | SQL Progressing Engine | TRIM Syntax | TRIM Function |
    +------------------------------------------------------+
    | Spark 3.0.0-preview/2  |      O      |       O       |
    | PostgreSQL             |      O      |       O       |
    | Presto                 |      O      |       O       |
    | MySQL                  |      O(*)   |       X       |
    | Oracle                 |      O      |       X       |
    | MsSQL                  |      O      |       X       |
    | Terradata              |      O      |       X       |
    | Hive                   |      X      |       X       |
    | Spark 1.6 ~ 2.2        |      X      |       X       |
    | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
    +------------------------+-------------+---------------+
    * MySQL doesn't allow multiple trim character
    * Spark 2.3 ~ 2.4 have the function in a different way.

Here is the illustrative example of the problem.

    postgres=# SELECT trim('yxTomxx', 'xyz');
    btrim
    -------
    Tom

    presto:default> SELECT trim('yxTomxx', 'xyz');
    _col0
    -------
    Tom

    spark-sql> SELECT trim('yxTomxx', 'xyz');
    z

Here is our history to fix the above issue.

    [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue
    1. https://github.com/apache/spark/pull/24902
       (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2 released.)
    2. https://github.com/apache/spark/pull/24907
       (Merged 2019-06-20 for v2.3.4, but reverted)
    3. https://github.com/apache/spark/pull/24908
       (Merged 2019-06-21 for v2.4.4, but reverted)

(2) and (3) were reverted before releases because we didn't want to fix that in the maintenance releases. Please see the following references of the decision.

    https://github.com/apache/spark/pull/24908#issuecomment-504799028 (2.3)
    https://github.com/apache/spark/pull/24907#issuecomment-504799021 (2.4)

Now, there are some requests to revert SPARK-28093 and to keep these esoteric functions for backward compatibility and the following reason.

    > Reordering function parameters to match another system,
    > for a method that is otherwise working correctly,
    > sounds exactly like a cosmetic change to me.

    > How can we silently change the parameter of an existing SQL function?
    > I don't think this is a correctness issue as the SQL standard
    > doesn't say that the function signature have to be trim(srcStr, trimStr).

The concern and the point of views make sense.

My concerns are the followings.

    1. This kind of esoteric differences are called `vendor-lock-in` stuffs in a negative way.
       - It's difficult for new users to understand.
       - It's hard to migrate between Apache Spark and the others.
    2. Although we did our bests, Apache Spark SQL has not been enough always.
    3. We need to do improvement in the future releases.

In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093 in order to keep this vendor-lock-in stuffs for backward-compatibility.

What I want is building a consistent point of views in this category for the upcoming PR reviews.

If we have a clear policy, we can save future community efforts in many ways.

Bests,
Dongjoon.



--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

cloud0fan
I don't know what's the best way to deprecate an SQL function. Runtime warning can be annoying if it keeps coming out. Maybe we should only log the warning once per Spark application.

On Tue, Feb 18, 2020 at 3:45 PM Dongjoon Hyun <[hidden email]> wrote:
Thank you for feedback, Wenchen, Maxim, and Takeshi.

    1. "we would also promote the SQL standard TRIM syntax"
    2. "we could output a warning with a notice about the order of parameters"
    3. "it looks nice to make these (two parameters) trim functions unrecommended in future releases"

Yes, in case of reverting SPARK-28093, we had better deprecate these two-parameter SQL function invocations. It's because this is really esoteric and even inconsistent inside Spark (Scala API also follows PostgresSQL/Presto style like the following.)

    def trim(e: Column, trimString: String)

If we keep this situation in 3.0.0 release (a major release), it means Apache Spark will be forever.
And, it becomes worse and worse because it leads more users to fall into this trap.

There are a few deprecation ways. I believe (3) can be a proper next step in case of reverting because (2) is infeasible and (1) is considered insufficient because it's silent when we do SPARK-28093. We need non-silent (noisy) one in this case. Technically, (3) can be done in `Analyzer.ResolveFunctions`.

    1. Documentation-only: Removing example and add migration guide
    2. Compile-time warning by annotation: This is not an option for SQL function in SQL string.
    3. Runtime warning with a directional guide
       (log.warn("... USE TRIM(trimStr FROM str) INSTEAD")

How do you think about (3)?
 
Bests,
Dongjoon.

On Sun, Feb 16, 2020 at 1:22 AM Takeshi Yamamuro <[hidden email]> wrote:
The revert looks fine to me for keeping the compatibility.
Also, I think the different orders between the systems easily lead to mistakes, so
, as Wenchen suggested, it looks nice to make these (two parameters) trim functions
unrecommended in future releases:
Actually, I think the SQL TRIM syntax is enough for trim use cases...

Bests,
Takeshi


On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk <[hidden email]> wrote:
Also if we look at possible combination of trim parameters:
1. foldable srcStr + foldable trimStr
2. foldable srcStr + non-foldable trimStr
3. non-foldable srcStr + foldable trimStr
4. non-foldable srcStr + non-foldable trimStr

The case # 2 seems a rare case, and # 3 is probably the most common case. Once we see the second case, we could output a warning with a notice about the order of parameters.

Maxim Gekk

Software Engineer

Databricks, Inc.



On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan <[hidden email]> wrote:
It's unfortunate that we don't have a clear document to talk about breaking changes (I'm working on it BTW). I believe the general guidance is: avoid breaking changes unless we have to. For example, the previous result was so broken that we have to fix it, moving to Scala 2.12 makes us have to break some APIs, etc.

For this particular case, do we have to switch the parameter order? It's different from some systems, the order was not decided explicitly, but I don't think they are strong reasons. People from RDBMS should use the SQL standard TRIM syntax more often. People using prior Spark versions should have figured out the parameter order of Spark TRIM (there was no document) and adjusted their queries. There is no such standard that defines the parameter order of the TRIM function.

In the long term, we would also promote the SQL standard TRIM syntax. I don't see much benefit of "fixing" the parameter order that worth to make a breaking change.

Thanks,
Wenchen



On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun <[hidden email]> wrote:
Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and TRIM(trimStr FROM str) syntax.

This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.

On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

I'm sending this email because the Apache Spark committers had better have a consistent point of views for the upcoming PRs. And, the community policy is the way to lead the community members transparently and clearly for a long term good.

First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache Spark 3.0.0 is going to achieve many improvement in SQL areas.

Especially, we invested lots of efforts to improve SQL ANSI compliance and related test coverage for the future. One simple example is the following.

    [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax

With this support, we are able to move away from one of esoteric Spark function `TRIM/LTRIM/RTRIM`.
(Note that the above syntax is ANSI standard, but we have additional non-standard these functions, too.)

Here is the summary of the current status.

    +------------------------+-------------+---------------+
    | SQL Progressing Engine | TRIM Syntax | TRIM Function |
    +------------------------------------------------------+
    | Spark 3.0.0-preview/2  |      O      |       O       |
    | PostgreSQL             |      O      |       O       |
    | Presto                 |      O      |       O       |
    | MySQL                  |      O(*)   |       X       |
    | Oracle                 |      O      |       X       |
    | MsSQL                  |      O      |       X       |
    | Terradata              |      O      |       X       |
    | Hive                   |      X      |       X       |
    | Spark 1.6 ~ 2.2        |      X      |       X       |
    | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
    +------------------------+-------------+---------------+
    * MySQL doesn't allow multiple trim character
    * Spark 2.3 ~ 2.4 have the function in a different way.

Here is the illustrative example of the problem.

    postgres=# SELECT trim('yxTomxx', 'xyz');
    btrim
    -------
    Tom

    presto:default> SELECT trim('yxTomxx', 'xyz');
    _col0
    -------
    Tom

    spark-sql> SELECT trim('yxTomxx', 'xyz');
    z

Here is our history to fix the above issue.

    [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue
    1. https://github.com/apache/spark/pull/24902
       (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2 released.)
    2. https://github.com/apache/spark/pull/24907
       (Merged 2019-06-20 for v2.3.4, but reverted)
    3. https://github.com/apache/spark/pull/24908
       (Merged 2019-06-21 for v2.4.4, but reverted)

(2) and (3) were reverted before releases because we didn't want to fix that in the maintenance releases. Please see the following references of the decision.

    https://github.com/apache/spark/pull/24908#issuecomment-504799028 (2.3)
    https://github.com/apache/spark/pull/24907#issuecomment-504799021 (2.4)

Now, there are some requests to revert SPARK-28093 and to keep these esoteric functions for backward compatibility and the following reason.

    > Reordering function parameters to match another system,
    > for a method that is otherwise working correctly,
    > sounds exactly like a cosmetic change to me.

    > How can we silently change the parameter of an existing SQL function?
    > I don't think this is a correctness issue as the SQL standard
    > doesn't say that the function signature have to be trim(srcStr, trimStr).

The concern and the point of views make sense.

My concerns are the followings.

    1. This kind of esoteric differences are called `vendor-lock-in` stuffs in a negative way.
       - It's difficult for new users to understand.
       - It's hard to migrate between Apache Spark and the others.
    2. Although we did our bests, Apache Spark SQL has not been enough always.
    3. We need to do improvement in the future releases.

In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093 in order to keep this vendor-lock-in stuffs for backward-compatibility.

What I want is building a consistent point of views in this category for the upcoming PR reviews.

If we have a clear policy, we can save future community efforts in many ways.

Bests,
Dongjoon.



--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Dongjoon Hyun-2
Yes, right. We need to choose the best way case by case.
The following level of `Runtime warning` also sounds reasonable to me.

> Maybe we should only log the warning once per Spark application.

Bests,
Dongjoon.

On Tue, Feb 18, 2020 at 1:13 AM Wenchen Fan <[hidden email]> wrote:
I don't know what's the best way to deprecate an SQL function. Runtime warning can be annoying if it keeps coming out. Maybe we should only log the warning once per Spark application.

On Tue, Feb 18, 2020 at 3:45 PM Dongjoon Hyun <[hidden email]> wrote:
Thank you for feedback, Wenchen, Maxim, and Takeshi.

    1. "we would also promote the SQL standard TRIM syntax"
    2. "we could output a warning with a notice about the order of parameters"
    3. "it looks nice to make these (two parameters) trim functions unrecommended in future releases"

Yes, in case of reverting SPARK-28093, we had better deprecate these two-parameter SQL function invocations. It's because this is really esoteric and even inconsistent inside Spark (Scala API also follows PostgresSQL/Presto style like the following.)

    def trim(e: Column, trimString: String)

If we keep this situation in 3.0.0 release (a major release), it means Apache Spark will be forever.
And, it becomes worse and worse because it leads more users to fall into this trap.

There are a few deprecation ways. I believe (3) can be a proper next step in case of reverting because (2) is infeasible and (1) is considered insufficient because it's silent when we do SPARK-28093. We need non-silent (noisy) one in this case. Technically, (3) can be done in `Analyzer.ResolveFunctions`.

    1. Documentation-only: Removing example and add migration guide
    2. Compile-time warning by annotation: This is not an option for SQL function in SQL string.
    3. Runtime warning with a directional guide
       (log.warn("... USE TRIM(trimStr FROM str) INSTEAD")

How do you think about (3)?
 
Bests,
Dongjoon.

On Sun, Feb 16, 2020 at 1:22 AM Takeshi Yamamuro <[hidden email]> wrote:
The revert looks fine to me for keeping the compatibility.
Also, I think the different orders between the systems easily lead to mistakes, so
, as Wenchen suggested, it looks nice to make these (two parameters) trim functions
unrecommended in future releases:
Actually, I think the SQL TRIM syntax is enough for trim use cases...

Bests,
Takeshi


On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk <[hidden email]> wrote:
Also if we look at possible combination of trim parameters:
1. foldable srcStr + foldable trimStr
2. foldable srcStr + non-foldable trimStr
3. non-foldable srcStr + foldable trimStr
4. non-foldable srcStr + non-foldable trimStr

The case # 2 seems a rare case, and # 3 is probably the most common case. Once we see the second case, we could output a warning with a notice about the order of parameters.

Maxim Gekk

Software Engineer

Databricks, Inc.



On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan <[hidden email]> wrote:
It's unfortunate that we don't have a clear document to talk about breaking changes (I'm working on it BTW). I believe the general guidance is: avoid breaking changes unless we have to. For example, the previous result was so broken that we have to fix it, moving to Scala 2.12 makes us have to break some APIs, etc.

For this particular case, do we have to switch the parameter order? It's different from some systems, the order was not decided explicitly, but I don't think they are strong reasons. People from RDBMS should use the SQL standard TRIM syntax more often. People using prior Spark versions should have figured out the parameter order of Spark TRIM (there was no document) and adjusted their queries. There is no such standard that defines the parameter order of the TRIM function.

In the long term, we would also promote the SQL standard TRIM syntax. I don't see much benefit of "fixing" the parameter order that worth to make a breaking change.

Thanks,
Wenchen



On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun <[hidden email]> wrote:
Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and TRIM(trimStr FROM str) syntax.

This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.

On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

I'm sending this email because the Apache Spark committers had better have a consistent point of views for the upcoming PRs. And, the community policy is the way to lead the community members transparently and clearly for a long term good.

First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache Spark 3.0.0 is going to achieve many improvement in SQL areas.

Especially, we invested lots of efforts to improve SQL ANSI compliance and related test coverage for the future. One simple example is the following.

    [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax

With this support, we are able to move away from one of esoteric Spark function `TRIM/LTRIM/RTRIM`.
(Note that the above syntax is ANSI standard, but we have additional non-standard these functions, too.)

Here is the summary of the current status.

    +------------------------+-------------+---------------+
    | SQL Progressing Engine | TRIM Syntax | TRIM Function |
    +------------------------------------------------------+
    | Spark 3.0.0-preview/2  |      O      |       O       |
    | PostgreSQL             |      O      |       O       |
    | Presto                 |      O      |       O       |
    | MySQL                  |      O(*)   |       X       |
    | Oracle                 |      O      |       X       |
    | MsSQL                  |      O      |       X       |
    | Terradata              |      O      |       X       |
    | Hive                   |      X      |       X       |
    | Spark 1.6 ~ 2.2        |      X      |       X       |
    | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
    +------------------------+-------------+---------------+
    * MySQL doesn't allow multiple trim character
    * Spark 2.3 ~ 2.4 have the function in a different way.

Here is the illustrative example of the problem.

    postgres=# SELECT trim('yxTomxx', 'xyz');
    btrim
    -------
    Tom

    presto:default> SELECT trim('yxTomxx', 'xyz');
    _col0
    -------
    Tom

    spark-sql> SELECT trim('yxTomxx', 'xyz');
    z

Here is our history to fix the above issue.

    [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue
    1. https://github.com/apache/spark/pull/24902
       (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2 released.)
    2. https://github.com/apache/spark/pull/24907
       (Merged 2019-06-20 for v2.3.4, but reverted)
    3. https://github.com/apache/spark/pull/24908
       (Merged 2019-06-21 for v2.4.4, but reverted)

(2) and (3) were reverted before releases because we didn't want to fix that in the maintenance releases. Please see the following references of the decision.

    https://github.com/apache/spark/pull/24908#issuecomment-504799028 (2.3)
    https://github.com/apache/spark/pull/24907#issuecomment-504799021 (2.4)

Now, there are some requests to revert SPARK-28093 and to keep these esoteric functions for backward compatibility and the following reason.

    > Reordering function parameters to match another system,
    > for a method that is otherwise working correctly,
    > sounds exactly like a cosmetic change to me.

    > How can we silently change the parameter of an existing SQL function?
    > I don't think this is a correctness issue as the SQL standard
    > doesn't say that the function signature have to be trim(srcStr, trimStr).

The concern and the point of views make sense.

My concerns are the followings.

    1. This kind of esoteric differences are called `vendor-lock-in` stuffs in a negative way.
       - It's difficult for new users to understand.
       - It's hard to migrate between Apache Spark and the others.
    2. Although we did our bests, Apache Spark SQL has not been enough always.
    3. We need to do improvement in the future releases.

In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093 in order to keep this vendor-lock-in stuffs for backward-compatibility.

What I want is building a consistent point of views in this category for the upcoming PR reviews.

If we have a clear policy, we can save future community efforts in many ways.

Bests,
Dongjoon.



--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Michael Armbrust
This plan for evolving the TRIM function to be more standards compliant sounds much better to me than the original change to just switch the order. It pushes users in the right direction and cleans up our tech debt without silently breaking existing workloads. It means that programs won't return different results when run on Spark 2.x and Spark 3.x.

One caveat:
If we keep this situation in 3.0.0 release (a major release), it means Apache Spark will be forever

I think this ship has already sailed. There is nothing special about 3.0 here. If the API is in a released version of Spark, than the mistake is already made.

Major releases are an opportunity to break APIs when we *have to*. We always strive to avoid breaking APIs even if they have not been in an X.0 release.