Preventing predicate pushdown

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Preventing predicate pushdown

Tomasz Gawęda
Hi,

while working with JDBC datasource I saw that many "or" clauses with
non-equality operators causes huge performance degradation of SQL query
to database (DB2). For example:

val df = spark.read.format("jdbc").(other options to parallelize
load).load()

df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x
 > 100)").show() // in real application whose predicates were pushed
many many lines below, many ANDs and ORs

If I use cache() before where, there is no predicate pushdown of this
"where" clause. However, in production system caching many sources is a
waste of memory (especially is pipeline is long and I must do cache many
times).


I asked on StackOverflow for better ideas:
https://stackoverflow.com/questions/50336355/how-to-prevent-predicate-pushdown

However, there are only workarounds. I can use those workarounds, but
maybe it would be better to add such functionality directly in the API?

For example: df.withAnalysisBarrier().where(...) ?

Please let me know if I should create a JIRA or it's not a good idea for
some reasons.


Pozdrawiam / Best regards,

Tomek Gawęda


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Preventing predicate pushdown

cloud0fan
applying predict pushdown is an optimization, and it makes sense to provide configs to turn off certain optimizations. Feel free to create a JIRA.

Thanks,
Wenchen

On Tue, May 15, 2018 at 8:33 PM, Tomasz Gawęda <[hidden email]> wrote:
Hi,

while working with JDBC datasource I saw that many "or" clauses with
non-equality operators causes huge performance degradation of SQL query
to database (DB2). For example:

val df = spark.read.format("jdbc").(other options to parallelize
load).load()

df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x
 > 100)").show() // in real application whose predicates were pushed
many many lines below, many ANDs and ORs

If I use cache() before where, there is no predicate pushdown of this
"where" clause. However, in production system caching many sources is a
waste of memory (especially is pipeline is long and I must do cache many
times).


I asked on StackOverflow for better ideas:
https://stackoverflow.com/questions/50336355/how-to-prevent-predicate-pushdown

However, there are only workarounds. I can use those workarounds, but
maybe it would be better to add such functionality directly in the API?

For example: df.withAnalysisBarrier().where(...) ?

Please let me know if I should create a JIRA or it's not a good idea for
some reasons.


Pozdrawiam / Best regards,

Tomek Gawęda


Reply | Threaded
Open this post in threaded view
|

Re: Preventing predicate pushdown

Tomasz Gawęda

Thanks, filled https://issues.apache.org/jira/browse/SPARK-24288

Pozdrawiam / Best regards,

Tomek


On 2018-05-15 18:29, Wenchen Fan wrote:
applying predict pushdown is an optimization, and it makes sense to provide configs to turn off certain optimizations. Feel free to create a JIRA.

Thanks,
Wenchen

On Tue, May 15, 2018 at 8:33 PM, Tomasz Gawęda <[hidden email]> wrote:
Hi,

while working with JDBC datasource I saw that many "or" clauses with
non-equality operators causes huge performance degradation of SQL query
to database (DB2). For example:

val df = spark.read.format("jdbc").(other options to parallelize
load).load()

df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x
 > 100)").show() // in real application whose predicates were pushed
many many lines below, many ANDs and ORs

If I use cache() before where, there is no predicate pushdown of this
"where" clause. However, in production system caching many sources is a
waste of memory (especially is pipeline is long and I must do cache many
times).


I asked on StackOverflow for better ideas:
https://stackoverflow.com/questions/50336355/how-to-prevent-predicate-pushdown

However, there are only workarounds. I can use those workarounds, but
maybe it would be better to add such functionality directly in the API?

For example: df.withAnalysisBarrier().where(...) ?

Please let me know if I should create a JIRA or it's not a good idea for
some reasons.


Pozdrawiam / Best regards,

Tomek Gawęda