[DISCUSS] PostgreSQL dialect

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] PostgreSQL dialect

cloud0fan
Hi all,

Recently we start an effort to achieve feature parity between Spark and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764

This goes very well. We've added many missing features(parser rules, built-in functions, etc.) to Spark, and also corrected several inappropriate behaviors of Spark to follow SQL standard and PostgreSQL. Many thanks to all the people that contribute to it!

There are several cases when adding a PostgreSQL feature:
1. Spark doesn't have this feature: just add it.
2. Spark has this feature, but the behavior is different:
    2.1 Spark's behavior doesn't make sense: change it to follow SQL standard and PostgreSQL, with a legacy config to restore the behavior.
    2.2 Spark's behavior makes sense but violates SQL standard: change the behavior to follow SQL standard and PostgreSQL, when the ansi mode is enabled (default false).
    2.3 Spark's behavior makes sense and doesn't violate SQL standard: adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark native dialect).

The PostgreSQL dialect itself is a good idea. It can help users to migrate PostgreSQL workloads to Spark. Other databases have this strategy too. For example, DB2 provides an oracle dialect.

However, there are so many differences between Spark and PostgreSQL, including SQL parsing, type coercion, function/operator behavior, data types, etc. I'm afraid that we may spend a lot of effort on it, and make the Spark codebase pretty complicated, but still not able to provide a usable PostgreSQL dialect.

Furthermore, it's not clear to me how many users have the requirement of migrating PostgreSQL workloads. I think it's much more important to make Spark ANSI-compliant first, which doesn't need that much of work.

Recently I've seen multiple PRs adding PostgreSQL cast functions, while our own cast function is not ANSI-compliant yet. This makes me think that, we should do something to properly prioritize ANSI mode over other dialects.

Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from the codebase before it's too late. Curently we only have 3 features under PostgreSQL dialect:
1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also allowed as true string.
2. `date - date`  returns interval in Spark (SQL standard behavior), but return int in PostgreSQL
3. `int / int` returns double in Spark, but returns int in PostgreSQL. (there is no standard)

We should still add PostgreSQL features that Spark doesn't have, or Spark's behavior violates SQL standard. But for others, let's just update the answer files of PostgreSQL tests.

Any comments are welcome!

Thanks,
Wenchen
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] PostgreSQL dialect

Sean Owen-2
Without knowing much about it, I have had the same question. How much is how important about this to justify the effort? One particular negative effect has been that new postgresql tests add well over an hour to tests, IIRC. So, tend to agree about drawing any reasonable line on compatibility and maybe focusing elsewhere

On Tue, Nov 26, 2019, 8:26 AM Wenchen Fan <[hidden email]> wrote:
Hi all,

Recently we start an effort to achieve feature parity between Spark and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764

This goes very well. We've added many missing features(parser rules, built-in functions, etc.) to Spark, and also corrected several inappropriate behaviors of Spark to follow SQL standard and PostgreSQL. Many thanks to all the people that contribute to it!

There are several cases when adding a PostgreSQL feature:
1. Spark doesn't have this feature: just add it.
2. Spark has this feature, but the behavior is different:
    2.1 Spark's behavior doesn't make sense: change it to follow SQL standard and PostgreSQL, with a legacy config to restore the behavior.
    2.2 Spark's behavior makes sense but violates SQL standard: change the behavior to follow SQL standard and PostgreSQL, when the ansi mode is enabled (default false).
    2.3 Spark's behavior makes sense and doesn't violate SQL standard: adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark native dialect).

The PostgreSQL dialect itself is a good idea. It can help users to migrate PostgreSQL workloads to Spark. Other databases have this strategy too. For example, DB2 provides an oracle dialect.

However, there are so many differences between Spark and PostgreSQL, including SQL parsing, type coercion, function/operator behavior, data types, etc. I'm afraid that we may spend a lot of effort on it, and make the Spark codebase pretty complicated, but still not able to provide a usable PostgreSQL dialect.

Furthermore, it's not clear to me how many users have the requirement of migrating PostgreSQL workloads. I think it's much more important to make Spark ANSI-compliant first, which doesn't need that much of work.

Recently I've seen multiple PRs adding PostgreSQL cast functions, while our own cast function is not ANSI-compliant yet. This makes me think that, we should do something to properly prioritize ANSI mode over other dialects.

Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from the codebase before it's too late. Curently we only have 3 features under PostgreSQL dialect:
1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also allowed as true string.
2. `date - date`  returns interval in Spark (SQL standard behavior), but return int in PostgreSQL
3. `int / int` returns double in Spark, but returns int in PostgreSQL. (there is no standard)

We should still add PostgreSQL features that Spark doesn't have, or Spark's behavior violates SQL standard. But for others, let's just update the answer files of PostgreSQL tests.

Any comments are welcome!

Thanks,
Wenchen
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] PostgreSQL dialect

zero323
In reply to this post by cloud0fan

I think it is important to distinguish between two different concepts:

  • Adherence to standards and their well established implementations.
  • Enabling migrations from some product X to Spark.

While these two problems are related, there are independent and one can be achieved without the other.

  • The former approach doesn't imply that all features of SQL standard (or its specific implementation) are provided. It is sufficient that commonly used features that are implemented, are standard compliant. Therefore if end user applies some well known pattern, thing will work as expected. I

    In my personal opinion that's something that is worth the required development resources, and in general should happen within the project.
  • The latter one is more complicated. First of all the premise that one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While both Spark and PostgreSQL evolve, and probably have more in common today, than a few years ago, they're not even close enough to pretend that one can be replacement for the other. In contrast, existing compatibility layers between major vendors make sense, because feature disparity (at least when it comes to core functionality) is usually minimal. And that doesn't even touch the problem that PostgreSQL provides extensively used extension points that enable broad and evolving ecosystem (what should we do about continuous queries? Should Structured Streaming provide some compatibility layer as well?).

    More realistically Spark could provide a compatibility layer with some analytical tools that itself provide some PostgreSQL compatibility, but these are not always fully compatible with upstream PostgreSQL, nor necessarily follow the latest PostgreSQL development.

    Furthermore compatibility layer can be, within certain limits (i.e. availability of required primitives), maintained as a separate project, without putting more strain on existing resources. Effectively what we care about here is if we can translate certain SQL string into logical or physical plan.


On 11/26/19 3:26 PM, Wenchen Fan wrote:
Hi all,

Recently we start an effort to achieve feature parity between Spark and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764

This goes very well. We've added many missing features(parser rules, built-in functions, etc.) to Spark, and also corrected several inappropriate behaviors of Spark to follow SQL standard and PostgreSQL. Many thanks to all the people that contribute to it!

There are several cases when adding a PostgreSQL feature:
1. Spark doesn't have this feature: just add it.
2. Spark has this feature, but the behavior is different:
    2.1 Spark's behavior doesn't make sense: change it to follow SQL standard and PostgreSQL, with a legacy config to restore the behavior.
    2.2 Spark's behavior makes sense but violates SQL standard: change the behavior to follow SQL standard and PostgreSQL, when the ansi mode is enabled (default false).
    2.3 Spark's behavior makes sense and doesn't violate SQL standard: adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark native dialect).

The PostgreSQL dialect itself is a good idea. It can help users to migrate PostgreSQL workloads to Spark. Other databases have this strategy too. For example, DB2 provides an oracle dialect.

However, there are so many differences between Spark and PostgreSQL, including SQL parsing, type coercion, function/operator behavior, data types, etc. I'm afraid that we may spend a lot of effort on it, and make the Spark codebase pretty complicated, but still not able to provide a usable PostgreSQL dialect.

Furthermore, it's not clear to me how many users have the requirement of migrating PostgreSQL workloads. I think it's much more important to make Spark ANSI-compliant first, which doesn't need that much of work.

Recently I've seen multiple PRs adding PostgreSQL cast functions, while our own cast function is not ANSI-compliant yet. This makes me think that, we should do something to properly prioritize ANSI mode over other dialects.

Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from the codebase before it's too late. Curently we only have 3 features under PostgreSQL dialect:
1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also allowed as true string.
2. `date - date`  returns interval in Spark (SQL standard behavior), but return int in PostgreSQL
3. `int / int` returns double in Spark, but returns int in PostgreSQL. (there is no standard)

We should still add PostgreSQL features that Spark doesn't have, or Spark's behavior violates SQL standard. But for others, let's just update the answer files of PostgreSQL tests.

Any comments are welcome!

Thanks,
Wenchen
-- 
Best regards,
Maciej
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] PostgreSQL dialect

Xiao Li-2
+1
 
One particular negative effect has been that new postgresql tests add well over an hour to tests,

Adding postgresql tests is for improving the test coverage of Spark SQL. We should continue to do this by importing more test cases. The quality of Spark highly depends on the test coverage. We can further paralyze the test execution to reduce the test time. 

Migrating PostgreSQL workloads to Spark SQL

This should not be our current focus. In the near future, it is impossible to be fully compatible with PostgreSQL. We should focus on adding features that are useful to Spark community. PostgreSQL is a good reference, but we do not need to blindly follow it. We already closed multiple related JIRAs that try to add some PostgreSQL features that are not commonly used. 

Cheers,

Xiao


On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <[hidden email]> wrote:

I think it is important to distinguish between two different concepts:

  • Adherence to standards and their well established implementations.
  • Enabling migrations from some product X to Spark.

While these two problems are related, there are independent and one can be achieved without the other.

  • The former approach doesn't imply that all features of SQL standard (or its specific implementation) are provided. It is sufficient that commonly used features that are implemented, are standard compliant. Therefore if end user applies some well known pattern, thing will work as expected. I

    In my personal opinion that's something that is worth the required development resources, and in general should happen within the project.
  • The latter one is more complicated. First of all the premise that one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While both Spark and PostgreSQL evolve, and probably have more in common today, than a few years ago, they're not even close enough to pretend that one can be replacement for the other. In contrast, existing compatibility layers between major vendors make sense, because feature disparity (at least when it comes to core functionality) is usually minimal. And that doesn't even touch the problem that PostgreSQL provides extensively used extension points that enable broad and evolving ecosystem (what should we do about continuous queries? Should Structured Streaming provide some compatibility layer as well?).

    More realistically Spark could provide a compatibility layer with some analytical tools that itself provide some PostgreSQL compatibility, but these are not always fully compatible with upstream PostgreSQL, nor necessarily follow the latest PostgreSQL development.

    Furthermore compatibility layer can be, within certain limits (i.e. availability of required primitives), maintained as a separate project, without putting more strain on existing resources. Effectively what we care about here is if we can translate certain SQL string into logical or physical plan.


On 11/26/19 3:26 PM, Wenchen Fan wrote:
Hi all,

Recently we start an effort to achieve feature parity between Spark and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764

This goes very well. We've added many missing features(parser rules, built-in functions, etc.) to Spark, and also corrected several inappropriate behaviors of Spark to follow SQL standard and PostgreSQL. Many thanks to all the people that contribute to it!

There are several cases when adding a PostgreSQL feature:
1. Spark doesn't have this feature: just add it.
2. Spark has this feature, but the behavior is different:
    2.1 Spark's behavior doesn't make sense: change it to follow SQL standard and PostgreSQL, with a legacy config to restore the behavior.
    2.2 Spark's behavior makes sense but violates SQL standard: change the behavior to follow SQL standard and PostgreSQL, when the ansi mode is enabled (default false).
    2.3 Spark's behavior makes sense and doesn't violate SQL standard: adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark native dialect).

The PostgreSQL dialect itself is a good idea. It can help users to migrate PostgreSQL workloads to Spark. Other databases have this strategy too. For example, DB2 provides an oracle dialect.

However, there are so many differences between Spark and PostgreSQL, including SQL parsing, type coercion, function/operator behavior, data types, etc. I'm afraid that we may spend a lot of effort on it, and make the Spark codebase pretty complicated, but still not able to provide a usable PostgreSQL dialect.

Furthermore, it's not clear to me how many users have the requirement of migrating PostgreSQL workloads. I think it's much more important to make Spark ANSI-compliant first, which doesn't need that much of work.

Recently I've seen multiple PRs adding PostgreSQL cast functions, while our own cast function is not ANSI-compliant yet. This makes me think that, we should do something to properly prioritize ANSI mode over other dialects.

Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from the codebase before it's too late. Curently we only have 3 features under PostgreSQL dialect:
1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also allowed as true string.
2. `date - date`  returns interval in Spark (SQL standard behavior), but return int in PostgreSQL
3. `int / int` returns double in Spark, but returns int in PostgreSQL. (there is no standard)

We should still add PostgreSQL features that Spark doesn't have, or Spark's behavior violates SQL standard. But for others, let's just update the answer files of PostgreSQL tests.

Any comments are welcome!

Thanks,
Wenchen
-- 
Best regards,
Maciej


--
Databricks Summit - Watch the talks 
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] PostgreSQL dialect

Gengliang Wang
+1 with the practical proposal.
To me, the major concern is that the code base becomes complicated, while the PostgreSQL dialect has very limited features. I tried introducing one big flag `spark.sql.dialect` and isolating related code in #25697, but it seems hard to be clean.
Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI mode, which can be confusing sometimes.

Gengliang

On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <[hidden email]> wrote:
+1
 
One particular negative effect has been that new postgresql tests add well over an hour to tests,

Adding postgresql tests is for improving the test coverage of Spark SQL. We should continue to do this by importing more test cases. The quality of Spark highly depends on the test coverage. We can further paralyze the test execution to reduce the test time. 

Migrating PostgreSQL workloads to Spark SQL

This should not be our current focus. In the near future, it is impossible to be fully compatible with PostgreSQL. We should focus on adding features that are useful to Spark community. PostgreSQL is a good reference, but we do not need to blindly follow it. We already closed multiple related JIRAs that try to add some PostgreSQL features that are not commonly used. 

Cheers,

Xiao


On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <[hidden email]> wrote:

I think it is important to distinguish between two different concepts:

  • Adherence to standards and their well established implementations.
  • Enabling migrations from some product X to Spark.

While these two problems are related, there are independent and one can be achieved without the other.

  • The former approach doesn't imply that all features of SQL standard (or its specific implementation) are provided. It is sufficient that commonly used features that are implemented, are standard compliant. Therefore if end user applies some well known pattern, thing will work as expected. I

    In my personal opinion that's something that is worth the required development resources, and in general should happen within the project.
  • The latter one is more complicated. First of all the premise that one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While both Spark and PostgreSQL evolve, and probably have more in common today, than a few years ago, they're not even close enough to pretend that one can be replacement for the other. In contrast, existing compatibility layers between major vendors make sense, because feature disparity (at least when it comes to core functionality) is usually minimal. And that doesn't even touch the problem that PostgreSQL provides extensively used extension points that enable broad and evolving ecosystem (what should we do about continuous queries? Should Structured Streaming provide some compatibility layer as well?).

    More realistically Spark could provide a compatibility layer with some analytical tools that itself provide some PostgreSQL compatibility, but these are not always fully compatible with upstream PostgreSQL, nor necessarily follow the latest PostgreSQL development.

    Furthermore compatibility layer can be, within certain limits (i.e. availability of required primitives), maintained as a separate project, without putting more strain on existing resources. Effectively what we care about here is if we can translate certain SQL string into logical or physical plan.


On 11/26/19 3:26 PM, Wenchen Fan wrote:
Hi all,

Recently we start an effort to achieve feature parity between Spark and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764

This goes very well. We've added many missing features(parser rules, built-in functions, etc.) to Spark, and also corrected several inappropriate behaviors of Spark to follow SQL standard and PostgreSQL. Many thanks to all the people that contribute to it!

There are several cases when adding a PostgreSQL feature:
1. Spark doesn't have this feature: just add it.
2. Spark has this feature, but the behavior is different:
    2.1 Spark's behavior doesn't make sense: change it to follow SQL standard and PostgreSQL, with a legacy config to restore the behavior.
    2.2 Spark's behavior makes sense but violates SQL standard: change the behavior to follow SQL standard and PostgreSQL, when the ansi mode is enabled (default false).
    2.3 Spark's behavior makes sense and doesn't violate SQL standard: adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark native dialect).

The PostgreSQL dialect itself is a good idea. It can help users to migrate PostgreSQL workloads to Spark. Other databases have this strategy too. For example, DB2 provides an oracle dialect.

However, there are so many differences between Spark and PostgreSQL, including SQL parsing, type coercion, function/operator behavior, data types, etc. I'm afraid that we may spend a lot of effort on it, and make the Spark codebase pretty complicated, but still not able to provide a usable PostgreSQL dialect.

Furthermore, it's not clear to me how many users have the requirement of migrating PostgreSQL workloads. I think it's much more important to make Spark ANSI-compliant first, which doesn't need that much of work.

Recently I've seen multiple PRs adding PostgreSQL cast functions, while our own cast function is not ANSI-compliant yet. This makes me think that, we should do something to properly prioritize ANSI mode over other dialects.

Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from the codebase before it's too late. Curently we only have 3 features under PostgreSQL dialect:
1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also allowed as true string.
2. `date - date`  returns interval in Spark (SQL standard behavior), but return int in PostgreSQL
3. `int / int` returns double in Spark, but returns int in PostgreSQL. (there is no standard)

We should still add PostgreSQL features that Spark doesn't have, or Spark's behavior violates SQL standard. But for others, let's just update the answer files of PostgreSQL tests.

Any comments are welcome!

Thanks,
Wenchen
-- 
Best regards,
Maciej


--
Databricks Summit - Watch the talks 
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] PostgreSQL dialect

Takeshi Yamamuro
Yea, +1, that looks pretty reasonable to me.
> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from the codebase before it's too late. Curently we only have 3 features under PostgreSQL dialect:
I personally think we could at least stop work about the Dialect until 3.0 released.


On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <[hidden email]> wrote:
+1 with the practical proposal.
To me, the major concern is that the code base becomes complicated, while the PostgreSQL dialect has very limited features. I tried introducing one big flag `spark.sql.dialect` and isolating related code in #25697, but it seems hard to be clean.
Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI mode, which can be confusing sometimes.

Gengliang

On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <[hidden email]> wrote:
+1
 
One particular negative effect has been that new postgresql tests add well over an hour to tests,

Adding postgresql tests is for improving the test coverage of Spark SQL. We should continue to do this by importing more test cases. The quality of Spark highly depends on the test coverage. We can further paralyze the test execution to reduce the test time. 

Migrating PostgreSQL workloads to Spark SQL

This should not be our current focus. In the near future, it is impossible to be fully compatible with PostgreSQL. We should focus on adding features that are useful to Spark community. PostgreSQL is a good reference, but we do not need to blindly follow it. We already closed multiple related JIRAs that try to add some PostgreSQL features that are not commonly used. 

Cheers,

Xiao


On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <[hidden email]> wrote:

I think it is important to distinguish between two different concepts:

  • Adherence to standards and their well established implementations.
  • Enabling migrations from some product X to Spark.

While these two problems are related, there are independent and one can be achieved without the other.

  • The former approach doesn't imply that all features of SQL standard (or its specific implementation) are provided. It is sufficient that commonly used features that are implemented, are standard compliant. Therefore if end user applies some well known pattern, thing will work as expected. I

    In my personal opinion that's something that is worth the required development resources, and in general should happen within the project.
  • The latter one is more complicated. First of all the premise that one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While both Spark and PostgreSQL evolve, and probably have more in common today, than a few years ago, they're not even close enough to pretend that one can be replacement for the other. In contrast, existing compatibility layers between major vendors make sense, because feature disparity (at least when it comes to core functionality) is usually minimal. And that doesn't even touch the problem that PostgreSQL provides extensively used extension points that enable broad and evolving ecosystem (what should we do about continuous queries? Should Structured Streaming provide some compatibility layer as well?).

    More realistically Spark could provide a compatibility layer with some analytical tools that itself provide some PostgreSQL compatibility, but these are not always fully compatible with upstream PostgreSQL, nor necessarily follow the latest PostgreSQL development.

    Furthermore compatibility layer can be, within certain limits (i.e. availability of required primitives), maintained as a separate project, without putting more strain on existing resources. Effectively what we care about here is if we can translate certain SQL string into logical or physical plan.


On 11/26/19 3:26 PM, Wenchen Fan wrote:
Hi all,

Recently we start an effort to achieve feature parity between Spark and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764

This goes very well. We've added many missing features(parser rules, built-in functions, etc.) to Spark, and also corrected several inappropriate behaviors of Spark to follow SQL standard and PostgreSQL. Many thanks to all the people that contribute to it!

There are several cases when adding a PostgreSQL feature:
1. Spark doesn't have this feature: just add it.
2. Spark has this feature, but the behavior is different:
    2.1 Spark's behavior doesn't make sense: change it to follow SQL standard and PostgreSQL, with a legacy config to restore the behavior.
    2.2 Spark's behavior makes sense but violates SQL standard: change the behavior to follow SQL standard and PostgreSQL, when the ansi mode is enabled (default false).
    2.3 Spark's behavior makes sense and doesn't violate SQL standard: adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark native dialect).

The PostgreSQL dialect itself is a good idea. It can help users to migrate PostgreSQL workloads to Spark. Other databases have this strategy too. For example, DB2 provides an oracle dialect.

However, there are so many differences between Spark and PostgreSQL, including SQL parsing, type coercion, function/operator behavior, data types, etc. I'm afraid that we may spend a lot of effort on it, and make the Spark codebase pretty complicated, but still not able to provide a usable PostgreSQL dialect.

Furthermore, it's not clear to me how many users have the requirement of migrating PostgreSQL workloads. I think it's much more important to make Spark ANSI-compliant first, which doesn't need that much of work.

Recently I've seen multiple PRs adding PostgreSQL cast functions, while our own cast function is not ANSI-compliant yet. This makes me think that, we should do something to properly prioritize ANSI mode over other dialects.

Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from the codebase before it's too late. Curently we only have 3 features under PostgreSQL dialect:
1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also allowed as true string.
2. `date - date`  returns interval in Spark (SQL standard behavior), but return int in PostgreSQL
3. `int / int` returns double in Spark, but returns int in PostgreSQL. (there is no standard)

We should still add PostgreSQL features that Spark doesn't have, or Spark's behavior violates SQL standard. But for others, let's just update the answer files of PostgreSQL tests.

Any comments are welcome!

Thanks,
Wenchen
-- 
Best regards,
Maciej


--
Databricks Summit - Watch the talks 


--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] PostgreSQL dialect

Dongjoon Hyun-2
+1

Bests,
Dongjoon.

On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro <[hidden email]> wrote:
Yea, +1, that looks pretty reasonable to me.
> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from the codebase before it's too late. Curently we only have 3 features under PostgreSQL dialect:
I personally think we could at least stop work about the Dialect until 3.0 released.


On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <[hidden email]> wrote:
+1 with the practical proposal.
To me, the major concern is that the code base becomes complicated, while the PostgreSQL dialect has very limited features. I tried introducing one big flag `spark.sql.dialect` and isolating related code in #25697, but it seems hard to be clean.
Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI mode, which can be confusing sometimes.

Gengliang

On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <[hidden email]> wrote:
+1
 
One particular negative effect has been that new postgresql tests add well over an hour to tests,

Adding postgresql tests is for improving the test coverage of Spark SQL. We should continue to do this by importing more test cases. The quality of Spark highly depends on the test coverage. We can further paralyze the test execution to reduce the test time. 

Migrating PostgreSQL workloads to Spark SQL

This should not be our current focus. In the near future, it is impossible to be fully compatible with PostgreSQL. We should focus on adding features that are useful to Spark community. PostgreSQL is a good reference, but we do not need to blindly follow it. We already closed multiple related JIRAs that try to add some PostgreSQL features that are not commonly used. 

Cheers,

Xiao


On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <[hidden email]> wrote:

I think it is important to distinguish between two different concepts:

  • Adherence to standards and their well established implementations.
  • Enabling migrations from some product X to Spark.

While these two problems are related, there are independent and one can be achieved without the other.

  • The former approach doesn't imply that all features of SQL standard (or its specific implementation) are provided. It is sufficient that commonly used features that are implemented, are standard compliant. Therefore if end user applies some well known pattern, thing will work as expected. I

    In my personal opinion that's something that is worth the required development resources, and in general should happen within the project.
  • The latter one is more complicated. First of all the premise that one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While both Spark and PostgreSQL evolve, and probably have more in common today, than a few years ago, they're not even close enough to pretend that one can be replacement for the other. In contrast, existing compatibility layers between major vendors make sense, because feature disparity (at least when it comes to core functionality) is usually minimal. And that doesn't even touch the problem that PostgreSQL provides extensively used extension points that enable broad and evolving ecosystem (what should we do about continuous queries? Should Structured Streaming provide some compatibility layer as well?).

    More realistically Spark could provide a compatibility layer with some analytical tools that itself provide some PostgreSQL compatibility, but these are not always fully compatible with upstream PostgreSQL, nor necessarily follow the latest PostgreSQL development.

    Furthermore compatibility layer can be, within certain limits (i.e. availability of required primitives), maintained as a separate project, without putting more strain on existing resources. Effectively what we care about here is if we can translate certain SQL string into logical or physical plan.


On 11/26/19 3:26 PM, Wenchen Fan wrote:
Hi all,

Recently we start an effort to achieve feature parity between Spark and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764

This goes very well. We've added many missing features(parser rules, built-in functions, etc.) to Spark, and also corrected several inappropriate behaviors of Spark to follow SQL standard and PostgreSQL. Many thanks to all the people that contribute to it!

There are several cases when adding a PostgreSQL feature:
1. Spark doesn't have this feature: just add it.
2. Spark has this feature, but the behavior is different:
    2.1 Spark's behavior doesn't make sense: change it to follow SQL standard and PostgreSQL, with a legacy config to restore the behavior.
    2.2 Spark's behavior makes sense but violates SQL standard: change the behavior to follow SQL standard and PostgreSQL, when the ansi mode is enabled (default false).
    2.3 Spark's behavior makes sense and doesn't violate SQL standard: adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark native dialect).

The PostgreSQL dialect itself is a good idea. It can help users to migrate PostgreSQL workloads to Spark. Other databases have this strategy too. For example, DB2 provides an oracle dialect.

However, there are so many differences between Spark and PostgreSQL, including SQL parsing, type coercion, function/operator behavior, data types, etc. I'm afraid that we may spend a lot of effort on it, and make the Spark codebase pretty complicated, but still not able to provide a usable PostgreSQL dialect.

Furthermore, it's not clear to me how many users have the requirement of migrating PostgreSQL workloads. I think it's much more important to make Spark ANSI-compliant first, which doesn't need that much of work.

Recently I've seen multiple PRs adding PostgreSQL cast functions, while our own cast function is not ANSI-compliant yet. This makes me think that, we should do something to properly prioritize ANSI mode over other dialects.

Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from the codebase before it's too late. Curently we only have 3 features under PostgreSQL dialect:
1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also allowed as true string.
2. `date - date`  returns interval in Spark (SQL standard behavior), but return int in PostgreSQL
3. `int / int` returns double in Spark, but returns int in PostgreSQL. (there is no standard)

We should still add PostgreSQL features that Spark doesn't have, or Spark's behavior violates SQL standard. But for others, let's just update the answer files of PostgreSQL tests.

Any comments are welcome!

Thanks,
Wenchen
-- 
Best regards,
Maciej


--
Databricks Summit - Watch the talks 


--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] PostgreSQL dialect

Driesprong, Fokko
+1 (non-binding)

Cheers, Fokko

Op do 28 nov. 2019 om 03:47 schreef Dongjoon Hyun <[hidden email]>:
+1

Bests,
Dongjoon.

On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro <[hidden email]> wrote:
Yea, +1, that looks pretty reasonable to me.
> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from the codebase before it's too late. Curently we only have 3 features under PostgreSQL dialect:
I personally think we could at least stop work about the Dialect until 3.0 released.


On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <[hidden email]> wrote:
+1 with the practical proposal.
To me, the major concern is that the code base becomes complicated, while the PostgreSQL dialect has very limited features. I tried introducing one big flag `spark.sql.dialect` and isolating related code in #25697, but it seems hard to be clean.
Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI mode, which can be confusing sometimes.

Gengliang

On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <[hidden email]> wrote:
+1
 
One particular negative effect has been that new postgresql tests add well over an hour to tests,

Adding postgresql tests is for improving the test coverage of Spark SQL. We should continue to do this by importing more test cases. The quality of Spark highly depends on the test coverage. We can further paralyze the test execution to reduce the test time. 

Migrating PostgreSQL workloads to Spark SQL

This should not be our current focus. In the near future, it is impossible to be fully compatible with PostgreSQL. We should focus on adding features that are useful to Spark community. PostgreSQL is a good reference, but we do not need to blindly follow it. We already closed multiple related JIRAs that try to add some PostgreSQL features that are not commonly used. 

Cheers,

Xiao


On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <[hidden email]> wrote:

I think it is important to distinguish between two different concepts:

  • Adherence to standards and their well established implementations.
  • Enabling migrations from some product X to Spark.

While these two problems are related, there are independent and one can be achieved without the other.

  • The former approach doesn't imply that all features of SQL standard (or its specific implementation) are provided. It is sufficient that commonly used features that are implemented, are standard compliant. Therefore if end user applies some well known pattern, thing will work as expected. I

    In my personal opinion that's something that is worth the required development resources, and in general should happen within the project.
  • The latter one is more complicated. First of all the premise that one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While both Spark and PostgreSQL evolve, and probably have more in common today, than a few years ago, they're not even close enough to pretend that one can be replacement for the other. In contrast, existing compatibility layers between major vendors make sense, because feature disparity (at least when it comes to core functionality) is usually minimal. And that doesn't even touch the problem that PostgreSQL provides extensively used extension points that enable broad and evolving ecosystem (what should we do about continuous queries? Should Structured Streaming provide some compatibility layer as well?).

    More realistically Spark could provide a compatibility layer with some analytical tools that itself provide some PostgreSQL compatibility, but these are not always fully compatible with upstream PostgreSQL, nor necessarily follow the latest PostgreSQL development.

    Furthermore compatibility layer can be, within certain limits (i.e. availability of required primitives), maintained as a separate project, without putting more strain on existing resources. Effectively what we care about here is if we can translate certain SQL string into logical or physical plan.


On 11/26/19 3:26 PM, Wenchen Fan wrote:
Hi all,

Recently we start an effort to achieve feature parity between Spark and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764

This goes very well. We've added many missing features(parser rules, built-in functions, etc.) to Spark, and also corrected several inappropriate behaviors of Spark to follow SQL standard and PostgreSQL. Many thanks to all the people that contribute to it!

There are several cases when adding a PostgreSQL feature:
1. Spark doesn't have this feature: just add it.
2. Spark has this feature, but the behavior is different:
    2.1 Spark's behavior doesn't make sense: change it to follow SQL standard and PostgreSQL, with a legacy config to restore the behavior.
    2.2 Spark's behavior makes sense but violates SQL standard: change the behavior to follow SQL standard and PostgreSQL, when the ansi mode is enabled (default false).
    2.3 Spark's behavior makes sense and doesn't violate SQL standard: adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark native dialect).

The PostgreSQL dialect itself is a good idea. It can help users to migrate PostgreSQL workloads to Spark. Other databases have this strategy too. For example, DB2 provides an oracle dialect.

However, there are so many differences between Spark and PostgreSQL, including SQL parsing, type coercion, function/operator behavior, data types, etc. I'm afraid that we may spend a lot of effort on it, and make the Spark codebase pretty complicated, but still not able to provide a usable PostgreSQL dialect.

Furthermore, it's not clear to me how many users have the requirement of migrating PostgreSQL workloads. I think it's much more important to make Spark ANSI-compliant first, which doesn't need that much of work.

Recently I've seen multiple PRs adding PostgreSQL cast functions, while our own cast function is not ANSI-compliant yet. This makes me think that, we should do something to properly prioritize ANSI mode over other dialects.

Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from the codebase before it's too late. Curently we only have 3 features under PostgreSQL dialect:
1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also allowed as true string.
2. `date - date`  returns interval in Spark (SQL standard behavior), but return int in PostgreSQL
3. `int / int` returns double in Spark, but returns int in PostgreSQL. (there is no standard)

We should still add PostgreSQL features that Spark doesn't have, or Spark's behavior violates SQL standard. But for others, let's just update the answer files of PostgreSQL tests.

Any comments are welcome!

Thanks,
Wenchen
-- 
Best regards,
Maciej


--
Databricks Summit - Watch the talks 


--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] PostgreSQL dialect

Yuanjian Li
Thanks all of you for joining the discussion.
The PR is given in https://github.com/apache/spark/pull/26763, all the PostgreSQL dialect related PRs are linked in the description.
Hoping the authors could help in reviewing.

Best,
Yuanjian

Driesprong, Fokko <[hidden email]> 于2019年12月1日周日 下午7:24写道:
+1 (non-binding)

Cheers, Fokko

Op do 28 nov. 2019 om 03:47 schreef Dongjoon Hyun <[hidden email]>:
+1

Bests,
Dongjoon.

On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro <[hidden email]> wrote:
Yea, +1, that looks pretty reasonable to me.
> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from the codebase before it's too late. Curently we only have 3 features under PostgreSQL dialect:
I personally think we could at least stop work about the Dialect until 3.0 released.


On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <[hidden email]> wrote:
+1 with the practical proposal.
To me, the major concern is that the code base becomes complicated, while the PostgreSQL dialect has very limited features. I tried introducing one big flag `spark.sql.dialect` and isolating related code in #25697, but it seems hard to be clean.
Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI mode, which can be confusing sometimes.

Gengliang

On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <[hidden email]> wrote:
+1
 
One particular negative effect has been that new postgresql tests add well over an hour to tests,

Adding postgresql tests is for improving the test coverage of Spark SQL. We should continue to do this by importing more test cases. The quality of Spark highly depends on the test coverage. We can further paralyze the test execution to reduce the test time. 

Migrating PostgreSQL workloads to Spark SQL

This should not be our current focus. In the near future, it is impossible to be fully compatible with PostgreSQL. We should focus on adding features that are useful to Spark community. PostgreSQL is a good reference, but we do not need to blindly follow it. We already closed multiple related JIRAs that try to add some PostgreSQL features that are not commonly used. 

Cheers,

Xiao


On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <[hidden email]> wrote:

I think it is important to distinguish between two different concepts:

  • Adherence to standards and their well established implementations.
  • Enabling migrations from some product X to Spark.

While these two problems are related, there are independent and one can be achieved without the other.

  • The former approach doesn't imply that all features of SQL standard (or its specific implementation) are provided. It is sufficient that commonly used features that are implemented, are standard compliant. Therefore if end user applies some well known pattern, thing will work as expected. I

    In my personal opinion that's something that is worth the required development resources, and in general should happen within the project.
  • The latter one is more complicated. First of all the premise that one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While both Spark and PostgreSQL evolve, and probably have more in common today, than a few years ago, they're not even close enough to pretend that one can be replacement for the other. In contrast, existing compatibility layers between major vendors make sense, because feature disparity (at least when it comes to core functionality) is usually minimal. And that doesn't even touch the problem that PostgreSQL provides extensively used extension points that enable broad and evolving ecosystem (what should we do about continuous queries? Should Structured Streaming provide some compatibility layer as well?).

    More realistically Spark could provide a compatibility layer with some analytical tools that itself provide some PostgreSQL compatibility, but these are not always fully compatible with upstream PostgreSQL, nor necessarily follow the latest PostgreSQL development.

    Furthermore compatibility layer can be, within certain limits (i.e. availability of required primitives), maintained as a separate project, without putting more strain on existing resources. Effectively what we care about here is if we can translate certain SQL string into logical or physical plan.


On 11/26/19 3:26 PM, Wenchen Fan wrote:
Hi all,

Recently we start an effort to achieve feature parity between Spark and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764

This goes very well. We've added many missing features(parser rules, built-in functions, etc.) to Spark, and also corrected several inappropriate behaviors of Spark to follow SQL standard and PostgreSQL. Many thanks to all the people that contribute to it!

There are several cases when adding a PostgreSQL feature:
1. Spark doesn't have this feature: just add it.
2. Spark has this feature, but the behavior is different:
    2.1 Spark's behavior doesn't make sense: change it to follow SQL standard and PostgreSQL, with a legacy config to restore the behavior.
    2.2 Spark's behavior makes sense but violates SQL standard: change the behavior to follow SQL standard and PostgreSQL, when the ansi mode is enabled (default false).
    2.3 Spark's behavior makes sense and doesn't violate SQL standard: adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark native dialect).

The PostgreSQL dialect itself is a good idea. It can help users to migrate PostgreSQL workloads to Spark. Other databases have this strategy too. For example, DB2 provides an oracle dialect.

However, there are so many differences between Spark and PostgreSQL, including SQL parsing, type coercion, function/operator behavior, data types, etc. I'm afraid that we may spend a lot of effort on it, and make the Spark codebase pretty complicated, but still not able to provide a usable PostgreSQL dialect.

Furthermore, it's not clear to me how many users have the requirement of migrating PostgreSQL workloads. I think it's much more important to make Spark ANSI-compliant first, which doesn't need that much of work.

Recently I've seen multiple PRs adding PostgreSQL cast functions, while our own cast function is not ANSI-compliant yet. This makes me think that, we should do something to properly prioritize ANSI mode over other dialects.

Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from the codebase before it's too late. Curently we only have 3 features under PostgreSQL dialect:
1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also allowed as true string.
2. `date - date`  returns interval in Spark (SQL standard behavior), but return int in PostgreSQL
3. `int / int` returns double in Spark, but returns int in PostgreSQL. (there is no standard)

We should still add PostgreSQL features that Spark doesn't have, or Spark's behavior violates SQL standard. But for others, let's just update the answer files of PostgreSQL tests.

Any comments are welcome!

Thanks,
Wenchen
-- 
Best regards,
Maciej


--
Databricks Summit - Watch the talks 


--
---
Takeshi Yamamuro