SPARK-22211: Removing an incorrect FOJ optimization
I'm digging into some Spark SQL tickets, and wanted to ask a procedural question about SPARK-22211 and optimizer changes in general.
To summarise the JIRA, Catalyst appears to be incorrectly pushing a limit down below a FULL OUTER JOIN, risking possibly incorrect results. I don't believe there is a simple, equivalent optimization available that we could use instead. There *is* a possibility that, with co-ordination between the logical and physical planners, we could use a join implementation that makes the optimization correct (see (*) below for some details).
My question is more general: how does the community make decisions about optimizer changes that have non-uniform effects on plan quality? Is there a standard set of benchmark queries that people run to judge the impact on common workloads?
It's clear that there needs to be a bugfix here - but we could just fix the bug and disable pushdowns below FOJs. The change (*) to preserve the optimization is arguably too brittle, and itself may not always be effective if it forces the physical planner to choose an implementation that's suboptimal just to allow the limit to get pushed down.
I am minded to push a patch that just disables the FOJ limit-pushing rule, and consider more complex optimizations as a follow-up, but wanted to see if I'm missing some inputs first.
(*) Pushing the limit down is safe if the join operator is guaranteed to emit unmatched tuples from the limited side before any unmatched tuples from the unlimited side. This is impractical with a sort-merge join, but possible with a hash-join if the limit gets pushed to the streaming side. The optimizer would also be prevented from reordering the join (and flipping the inputs) unless it treated the limit correctly and kept it fixed to the streaming side. But in principle, we could force the physical planner to detect a FOJ with a pushed-down limit, and only select a compatible join operator.