Inconsistencies with how catalyst optimizer handles non-deterministic expressions
I believe, that currently non-deterministic expressions are handled in two
conflicting approaches in the catalyst optimizer.
The first approach is the one I have seen in the recent pull request reviews
- the optimizer should never change the number of times a non-deterministic
expression is executed. A good example of this is /`Canonicalize.scala`/:
* In addition and multiplication we allow reordering non-deterministic
expressions, because both sides will be evaluated anyways.
* In boolean OR and AND we *do not* allow reordering non-deterministic
expressions, because the right side might not be evaluated.
Then there is another approach, where we allow reordering non-deterministic
expressions even in boolean OR and AND. A good example of this is the
/`PushPredicateThroughJoin`/ rule where we use the
/`condition.partition(_.deterministic)`/ pattern. Later the partitioned
expressions can be concatenated back, but this effectively changes the order
of execution and can make some non-deterministic expressions be not
evaluated on all the rows they would have been.
I'm sure that both of these approaches have good arguments for them. In my
* The first one allows users be more sure on how their stateful expressions
are evaluated - optimizer does not change the output.
* The second one allows catalyst to do better optimization.
But, by mixing both of them we get the worst of the both worlds - users
can't be sure about how the expressions are evaluated and we don't have the
"most optimal" queries.