Inconsistencies with how catalyst optimizer handles non-deterministic expressions

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Inconsistencies with how catalyst optimizer handles non-deterministic expressions

tanelk
Hello,

I believe, that currently non-deterministic expressions are handled in two
conflicting approaches in the catalyst optimizer.

The first approach is the one I have seen in the recent pull request reviews
- the optimizer should never change the number of times a non-deterministic
expression is executed. A good example of this is /`Canonicalize.scala`/:
* In addition and multiplication we allow reordering non-deterministic
expressions, because both sides will be evaluated anyways.
* In boolean OR and AND we *do not* allow reordering non-deterministic
expressions, because the right side might not be evaluated.

Then there is another approach, where we allow reordering non-deterministic
expressions even in boolean OR and AND. A good example of this is the
/`PushPredicateThroughJoin`/ rule where we use the
/`condition.partition(_.deterministic)`/ pattern. Later the partitioned
expressions can be concatenated back, but this effectively changes the order
of execution and can make some non-deterministic expressions be not
evaluated on all the rows they would have been.

Initially I was sure, that the second approach is wrong and was about to
make a pull request to fix this. But then I found that this has not been an
accidental mistake, but it is done so on purpose:
https://github.com/apache/spark/pull/20069
<https://github.com/apache/spark/pull/20069>  .

I'm sure that both of these approaches have good arguments for them. In my
eyes:
* The first one allows users be more sure on how their stateful expressions
are evaluated - optimizer does not change the output.
* The second one allows catalyst to do better optimization.
But, by mixing both of them we get the worst of the both worlds - users
can't be sure about how the expressions are evaluated and we don't have the
"most optimal" queries.

 What is the community's stance on this issue?

Regards,
Tanel



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]