[SparkSql] Casting of Predicate Literals

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[SparkSql] Casting of Predicate Literals

RussS
I've just run into this issue again with another user and I feel like most folks here have seen some flavor of this at some point. 

The user registers a Datasource with a column of type Date (or some non string) then performs a query that looks like.

SELECT * from Source WHERE date_col > '2020-08-03'

Seeing that the predicate literal here is a String, Spark needs to make a change so that the DataSource column will be of the same type (Date), 
so it places a "Cast" on the Datasource column so our plan ends up looking like.

Cast(date_col as String) > '2020-08-03' 

Since the Datasource Strategies can't handle a push down of the "Cast" function we lose the predicate pushdown we could 
have had. This can change a Job from a single partition lookup into a full scan leading to a very confusing situation for
the end user. I also wonder about the relative cost here since we could be avoiding doing X casts and instead just do a single
one on the predicate, in addition we could be doing the cast at the Analysis phase and cut the run short before any work even
starts rather than doing a perhaps meaningless comparison between a date and a non-date string.

I think we should seriously consider whether in cases like this we should attempt to cast the literal rather than casting the
source column. 

Please let me know if anyone has thoughts on this, or has some previous Jiras I could dig into if it's been discussed before,
Russ
Reply | Threaded
Open this post in threaded view
|

Re: [SparkSql] Casting of Predicate Literals

cloud0fan
I think this is not a problem in 3.0 anymore, see https://issues.apache.org/jira/browse/SPARK-27638

On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <[hidden email]> wrote:
I've just run into this issue again with another user and I feel like most folks here have seen some flavor of this at some point. 

The user registers a Datasource with a column of type Date (or some non string) then performs a query that looks like.

SELECT * from Source WHERE date_col > '2020-08-03'

Seeing that the predicate literal here is a String, Spark needs to make a change so that the DataSource column will be of the same type (Date), 
so it places a "Cast" on the Datasource column so our plan ends up looking like.

Cast(date_col as String) > '2020-08-03' 

Since the Datasource Strategies can't handle a push down of the "Cast" function we lose the predicate pushdown we could 
have had. This can change a Job from a single partition lookup into a full scan leading to a very confusing situation for
the end user. I also wonder about the relative cost here since we could be avoiding doing X casts and instead just do a single
one on the predicate, in addition we could be doing the cast at the Analysis phase and cut the run short before any work even
starts rather than doing a perhaps meaningless comparison between a date and a non-date string.

I think we should seriously consider whether in cases like this we should attempt to cast the literal rather than casting the
source column. 

Please let me know if anyone has thoughts on this, or has some previous Jiras I could dig into if it's been discussed before,
Russ
Reply | Threaded
Open this post in threaded view
|

Re: [SparkSql] Casting of Predicate Literals

Xiao Li-2
Hi, Russell, 

You might hit the other cases in which CAST blocks the predicate pushdown. If the Cast was added by users and it changes the actual type, we are unable to optimize it automatically because it could change the query correctness. If it was added by our type coercion rules to make type consistent at query compile time, we can take a look at the specific rule. If you think any of them is not reasonable or have different behaviors from the other database systems, we can discuss it in the PRs or JIRAs. In general, we have to be very cautious to make any change in these rules since it could have a big impact and change the query results silently.

Thanks,

On Tue, Aug 4, 2020 at 9:46 AM Wenchen Fan <[hidden email]> wrote:
I think this is not a problem in 3.0 anymore, see https://issues.apache.org/jira/browse/SPARK-27638

On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <[hidden email]> wrote:
I've just run into this issue again with another user and I feel like most folks here have seen some flavor of this at some point. 

The user registers a Datasource with a column of type Date (or some non string) then performs a query that looks like.

SELECT * from Source WHERE date_col > '2020-08-03'

Seeing that the predicate literal here is a String, Spark needs to make a change so that the DataSource column will be of the same type (Date), 
so it places a "Cast" on the Datasource column so our plan ends up looking like.

Cast(date_col as String) > '2020-08-03' 

Since the Datasource Strategies can't handle a push down of the "Cast" function we lose the predicate pushdown we could 
have had. This can change a Job from a single partition lookup into a full scan leading to a very confusing situation for
the end user. I also wonder about the relative cost here since we could be avoiding doing X casts and instead just do a single
one on the predicate, in addition we could be doing the cast at the Analysis phase and cut the run short before any work even
starts rather than doing a perhaps meaningless comparison between a date and a non-date string.

I think we should seriously consider whether in cases like this we should attempt to cast the literal rather than casting the
source column. 

Please let me know if anyone has thoughts on this, or has some previous Jiras I could dig into if it's been discussed before,
Russ


--
Reply | Threaded
Open this post in threaded view
|

Re: [SparkSql] Casting of Predicate Literals

RussS
In reply to this post by cloud0fan
Thanks! That's exactly what I was hoping for! Thanks for finding the Jira for me!

On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan <[hidden email]> wrote:
I think this is not a problem in 3.0 anymore, see https://issues.apache.org/jira/browse/SPARK-27638

On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <[hidden email]> wrote:
I've just run into this issue again with another user and I feel like most folks here have seen some flavor of this at some point. 

The user registers a Datasource with a column of type Date (or some non string) then performs a query that looks like.

SELECT * from Source WHERE date_col > '2020-08-03'

Seeing that the predicate literal here is a String, Spark needs to make a change so that the DataSource column will be of the same type (Date), 
so it places a "Cast" on the Datasource column so our plan ends up looking like.

Cast(date_col as String) > '2020-08-03' 

Since the Datasource Strategies can't handle a push down of the "Cast" function we lose the predicate pushdown we could 
have had. This can change a Job from a single partition lookup into a full scan leading to a very confusing situation for
the end user. I also wonder about the relative cost here since we could be avoiding doing X casts and instead just do a single
one on the predicate, in addition we could be doing the cast at the Analysis phase and cut the run short before any work even
starts rather than doing a perhaps meaningless comparison between a date and a non-date string.

I think we should seriously consider whether in cases like this we should attempt to cast the literal rather than casting the
source column. 

Please let me know if anyone has thoughts on this, or has some previous Jiras I could dig into if it's been discussed before,
Russ
Reply | Threaded
Open this post in threaded view
|

Re: [SparkSql] Casting of Predicate Literals

Bart Samwel-2
And how are we doing here on integer pushdowns? If someone does e.g. CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000" without the cast?

On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer <[hidden email]> wrote:
Thanks! That's exactly what I was hoping for! Thanks for finding the Jira for me!

On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan <[hidden email]> wrote:
I think this is not a problem in 3.0 anymore, see https://issues.apache.org/jira/browse/SPARK-27638

On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <[hidden email]> wrote:
I've just run into this issue again with another user and I feel like most folks here have seen some flavor of this at some point. 

The user registers a Datasource with a column of type Date (or some non string) then performs a query that looks like.

SELECT * from Source WHERE date_col > '2020-08-03'

Seeing that the predicate literal here is a String, Spark needs to make a change so that the DataSource column will be of the same type (Date), 
so it places a "Cast" on the Datasource column so our plan ends up looking like.

Cast(date_col as String) > '2020-08-03' 

Since the Datasource Strategies can't handle a push down of the "Cast" function we lose the predicate pushdown we could 
have had. This can change a Job from a single partition lookup into a full scan leading to a very confusing situation for
the end user. I also wonder about the relative cost here since we could be avoiding doing X casts and instead just do a single
one on the predicate, in addition we could be doing the cast at the Analysis phase and cut the run short before any work even
starts rather than doing a perhaps meaningless comparison between a date and a non-date string.

I think we should seriously consider whether in cases like this we should attempt to cast the literal rather than casting the
source column. 

Please let me know if anyone has thoughts on this, or has some previous Jiras I could dig into if it's been discussed before,
Russ


--
Bart Samwel


Reply | Threaded
Open this post in threaded view
|

Re: [SparkSql] Casting of Predicate Literals

cloud0fan
Currently we can't. This is something we should improve, by either pushing down the cast to the data source, or simplifying the predicates to eliminate the cast.

On Wed, Aug 19, 2020 at 5:09 PM Bart Samwel <[hidden email]> wrote:
And how are we doing here on integer pushdowns? If someone does e.g. CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000" without the cast?

On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer <[hidden email]> wrote:
Thanks! That's exactly what I was hoping for! Thanks for finding the Jira for me!

On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan <[hidden email]> wrote:
I think this is not a problem in 3.0 anymore, see https://issues.apache.org/jira/browse/SPARK-27638

On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <[hidden email]> wrote:
I've just run into this issue again with another user and I feel like most folks here have seen some flavor of this at some point. 

The user registers a Datasource with a column of type Date (or some non string) then performs a query that looks like.

SELECT * from Source WHERE date_col > '2020-08-03'

Seeing that the predicate literal here is a String, Spark needs to make a change so that the DataSource column will be of the same type (Date), 
so it places a "Cast" on the Datasource column so our plan ends up looking like.

Cast(date_col as String) > '2020-08-03' 

Since the Datasource Strategies can't handle a push down of the "Cast" function we lose the predicate pushdown we could 
have had. This can change a Job from a single partition lookup into a full scan leading to a very confusing situation for
the end user. I also wonder about the relative cost here since we could be avoiding doing X casts and instead just do a single
one on the predicate, in addition we could be doing the cast at the Analysis phase and cut the run short before any work even
starts rather than doing a perhaps meaningless comparison between a date and a non-date string.

I think we should seriously consider whether in cases like this we should attempt to cast the literal rather than casting the
source column. 

Please let me know if anyone has thoughts on this, or has some previous Jiras I could dig into if it's been discussed before,
Russ


--
Bart Samwel


Reply | Threaded
Open this post in threaded view
|

Re: [SparkSql] Casting of Predicate Literals

Chao Sun
> Currently we can't. This is something we should improve, by either pushing down the cast to the data source, or simplifying the predicates to eliminate the cast.

Hi all, I've created https://issues.apache.org/jira/browse/SPARK-32694 to track this. Welcome to comment on the JIRA.

On Wed, Aug 19, 2020 at 7:08 AM Wenchen Fan <[hidden email]> wrote:
Currently we can't. This is something we should improve, by either pushing down the cast to the data source, or simplifying the predicates to eliminate the cast.

On Wed, Aug 19, 2020 at 5:09 PM Bart Samwel <[hidden email]> wrote:
And how are we doing here on integer pushdowns? If someone does e.g. CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000" without the cast?

On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer <[hidden email]> wrote:
Thanks! That's exactly what I was hoping for! Thanks for finding the Jira for me!

On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan <[hidden email]> wrote:
I think this is not a problem in 3.0 anymore, see https://issues.apache.org/jira/browse/SPARK-27638

On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <[hidden email]> wrote:
I've just run into this issue again with another user and I feel like most folks here have seen some flavor of this at some point. 

The user registers a Datasource with a column of type Date (or some non string) then performs a query that looks like.

SELECT * from Source WHERE date_col > '2020-08-03'

Seeing that the predicate literal here is a String, Spark needs to make a change so that the DataSource column will be of the same type (Date), 
so it places a "Cast" on the Datasource column so our plan ends up looking like.

Cast(date_col as String) > '2020-08-03' 

Since the Datasource Strategies can't handle a push down of the "Cast" function we lose the predicate pushdown we could 
have had. This can change a Job from a single partition lookup into a full scan leading to a very confusing situation for
the end user. I also wonder about the relative cost here since we could be avoiding doing X casts and instead just do a single
one on the predicate, in addition we could be doing the cast at the Analysis phase and cut the run short before any work even
starts rather than doing a perhaps meaningless comparison between a date and a non-date string.

I think we should seriously consider whether in cases like this we should attempt to cast the literal rather than casting the
source column. 

Please let me know if anyone has thoughts on this, or has some previous Jiras I could dig into if it's been discussed before,
Russ


--
Bart Samwel


Reply | Threaded
Open this post in threaded view
|

Re: [SparkSql] Casting of Predicate Literals

Chao Sun
Hi,

So just realized there were already multiple attempts on this issue in the past. From the discussion it seems the preferred approach is to eliminate the cast before they get pushed to data sources, at least for a few common cases such as numeric types. However, a few PRs following this direction were rejected (see [1] and [2]), so I'm wondering if this is still something worth trying, or if the community thinks this is risky and better not touch it. 

On the other hand, perhaps we can do the minimum and generate some sort of warning to remind users that they need to explicitly add cast to enable pushdown in this case. What do you think?

Thanks for your input!
Chao



On Mon, Aug 24, 2020 at 1:57 PM Chao Sun <[hidden email]> wrote:
> Currently we can't. This is something we should improve, by either pushing down the cast to the data source, or simplifying the predicates to eliminate the cast.

Hi all, I've created https://issues.apache.org/jira/browse/SPARK-32694 to track this. Welcome to comment on the JIRA.

On Wed, Aug 19, 2020 at 7:08 AM Wenchen Fan <[hidden email]> wrote:
Currently we can't. This is something we should improve, by either pushing down the cast to the data source, or simplifying the predicates to eliminate the cast.

On Wed, Aug 19, 2020 at 5:09 PM Bart Samwel <[hidden email]> wrote:
And how are we doing here on integer pushdowns? If someone does e.g. CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000" without the cast?

On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer <[hidden email]> wrote:
Thanks! That's exactly what I was hoping for! Thanks for finding the Jira for me!

On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan <[hidden email]> wrote:
I think this is not a problem in 3.0 anymore, see https://issues.apache.org/jira/browse/SPARK-27638

On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <[hidden email]> wrote:
I've just run into this issue again with another user and I feel like most folks here have seen some flavor of this at some point. 

The user registers a Datasource with a column of type Date (or some non string) then performs a query that looks like.

SELECT * from Source WHERE date_col > '2020-08-03'

Seeing that the predicate literal here is a String, Spark needs to make a change so that the DataSource column will be of the same type (Date), 
so it places a "Cast" on the Datasource column so our plan ends up looking like.

Cast(date_col as String) > '2020-08-03' 

Since the Datasource Strategies can't handle a push down of the "Cast" function we lose the predicate pushdown we could 
have had. This can change a Job from a single partition lookup into a full scan leading to a very confusing situation for
the end user. I also wonder about the relative cost here since we could be avoiding doing X casts and instead just do a single
one on the predicate, in addition we could be doing the cast at the Analysis phase and cut the run short before any work even
starts rather than doing a perhaps meaningless comparison between a date and a non-date string.

I think we should seriously consider whether in cases like this we should attempt to cast the literal rather than casting the
source column. 

Please let me know if anyone has thoughts on this, or has some previous Jiras I could dig into if it's been discussed before,
Russ


--
Bart Samwel


Reply | Threaded
Open this post in threaded view
|

Re: [SparkSql] Casting of Predicate Literals

Bart Samwel
IMO it's worth an attempt. The previous attempts seem to be closed because of a general sense that this gets messy and leads to lots of special cases, but that's just how it is. This optimization would make the difference between getting sub-par performance for using some of these datatypes to getting decent performance. Also, even if the predicate doesn't get pushed down, the transformation can make execution of the predicate faster. So this can be an early optimization rule, not tied to pushdowns specifically.

I agree that it gets tricky for some data types. So I'd suggest starting small and doing this only for integers. Then cover decimals. For those data types at least you can easily reason that the conversion is correct. Other data types are a lot trickier and we should analyze them one by one.

On Tue, Aug 25, 2020 at 7:31 PM Chao Sun <[hidden email]> wrote:
Hi,

So just realized there were already multiple attempts on this issue in the past. From the discussion it seems the preferred approach is to eliminate the cast before they get pushed to data sources, at least for a few common cases such as numeric types. However, a few PRs following this direction were rejected (see [1] and [2]), so I'm wondering if this is still something worth trying, or if the community thinks this is risky and better not touch it. 

On the other hand, perhaps we can do the minimum and generate some sort of warning to remind users that they need to explicitly add cast to enable pushdown in this case. What do you think?

Thanks for your input!
Chao



On Mon, Aug 24, 2020 at 1:57 PM Chao Sun <[hidden email]> wrote:
> Currently we can't. This is something we should improve, by either pushing down the cast to the data source, or simplifying the predicates to eliminate the cast.

Hi all, I've created https://issues.apache.org/jira/browse/SPARK-32694 to track this. Welcome to comment on the JIRA.

On Wed, Aug 19, 2020 at 7:08 AM Wenchen Fan <[hidden email]> wrote:
Currently we can't. This is something we should improve, by either pushing down the cast to the data source, or simplifying the predicates to eliminate the cast.

On Wed, Aug 19, 2020 at 5:09 PM Bart Samwel <[hidden email]> wrote:
And how are we doing here on integer pushdowns? If someone does e.g. CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000" without the cast?

On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer <[hidden email]> wrote:
Thanks! That's exactly what I was hoping for! Thanks for finding the Jira for me!

On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan <[hidden email]> wrote:
I think this is not a problem in 3.0 anymore, see https://issues.apache.org/jira/browse/SPARK-27638

On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <[hidden email]> wrote:
I've just run into this issue again with another user and I feel like most folks here have seen some flavor of this at some point. 

The user registers a Datasource with a column of type Date (or some non string) then performs a query that looks like.

SELECT * from Source WHERE date_col > '2020-08-03'

Seeing that the predicate literal here is a String, Spark needs to make a change so that the DataSource column will be of the same type (Date), 
so it places a "Cast" on the Datasource column so our plan ends up looking like.

Cast(date_col as String) > '2020-08-03' 

Since the Datasource Strategies can't handle a push down of the "Cast" function we lose the predicate pushdown we could 
have had. This can change a Job from a single partition lookup into a full scan leading to a very confusing situation for
the end user. I also wonder about the relative cost here since we could be avoiding doing X casts and instead just do a single
one on the predicate, in addition we could be doing the cast at the Analysis phase and cut the run short before any work even
starts rather than doing a perhaps meaningless comparison between a date and a non-date string.

I think we should seriously consider whether in cases like this we should attempt to cast the literal rather than casting the
source column. 

Please let me know if anyone has thoughts on this, or has some previous Jiras I could dig into if it's been discussed before,
Russ


--
Bart Samwel




--
Bart Samwel


Reply | Threaded
Open this post in threaded view
|

Re: [SparkSql] Casting of Predicate Literals

Chao Sun
Thanks Bart. I'll give it a try. Presto has done something very similar on this (thanks DB for finding this!). They published an article ([1]) last year with a very thorough analysis on all the cases which I think can be used as a reference for the implementation in Spark.


On Wed, Aug 26, 2020 at 1:37 AM Bart Samwel <[hidden email]> wrote:
IMO it's worth an attempt. The previous attempts seem to be closed because of a general sense that this gets messy and leads to lots of special cases, but that's just how it is. This optimization would make the difference between getting sub-par performance for using some of these datatypes to getting decent performance. Also, even if the predicate doesn't get pushed down, the transformation can make execution of the predicate faster. So this can be an early optimization rule, not tied to pushdowns specifically.

I agree that it gets tricky for some data types. So I'd suggest starting small and doing this only for integers. Then cover decimals. For those data types at least you can easily reason that the conversion is correct. Other data types are a lot trickier and we should analyze them one by one.

On Tue, Aug 25, 2020 at 7:31 PM Chao Sun <[hidden email]> wrote:
Hi,

So just realized there were already multiple attempts on this issue in the past. From the discussion it seems the preferred approach is to eliminate the cast before they get pushed to data sources, at least for a few common cases such as numeric types. However, a few PRs following this direction were rejected (see [1] and [2]), so I'm wondering if this is still something worth trying, or if the community thinks this is risky and better not touch it. 

On the other hand, perhaps we can do the minimum and generate some sort of warning to remind users that they need to explicitly add cast to enable pushdown in this case. What do you think?

Thanks for your input!
Chao



On Mon, Aug 24, 2020 at 1:57 PM Chao Sun <[hidden email]> wrote:
> Currently we can't. This is something we should improve, by either pushing down the cast to the data source, or simplifying the predicates to eliminate the cast.

Hi all, I've created https://issues.apache.org/jira/browse/SPARK-32694 to track this. Welcome to comment on the JIRA.

On Wed, Aug 19, 2020 at 7:08 AM Wenchen Fan <[hidden email]> wrote:
Currently we can't. This is something we should improve, by either pushing down the cast to the data source, or simplifying the predicates to eliminate the cast.

On Wed, Aug 19, 2020 at 5:09 PM Bart Samwel <[hidden email]> wrote:
And how are we doing here on integer pushdowns? If someone does e.g. CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000" without the cast?

On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer <[hidden email]> wrote:
Thanks! That's exactly what I was hoping for! Thanks for finding the Jira for me!

On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan <[hidden email]> wrote:
I think this is not a problem in 3.0 anymore, see https://issues.apache.org/jira/browse/SPARK-27638

On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <[hidden email]> wrote:
I've just run into this issue again with another user and I feel like most folks here have seen some flavor of this at some point. 

The user registers a Datasource with a column of type Date (or some non string) then performs a query that looks like.

SELECT * from Source WHERE date_col > '2020-08-03'

Seeing that the predicate literal here is a String, Spark needs to make a change so that the DataSource column will be of the same type (Date), 
so it places a "Cast" on the Datasource column so our plan ends up looking like.

Cast(date_col as String) > '2020-08-03' 

Since the Datasource Strategies can't handle a push down of the "Cast" function we lose the predicate pushdown we could 
have had. This can change a Job from a single partition lookup into a full scan leading to a very confusing situation for
the end user. I also wonder about the relative cost here since we could be avoiding doing X casts and instead just do a single
one on the predicate, in addition we could be doing the cast at the Analysis phase and cut the run short before any work even
starts rather than doing a perhaps meaningless comparison between a date and a non-date string.

I think we should seriously consider whether in cases like this we should attempt to cast the literal rather than casting the
source column. 

Please let me know if anyone has thoughts on this, or has some previous Jiras I could dig into if it's been discussed before,
Russ


--
Bart Samwel




--
Bart Samwel