[SQL] Understanding RewriteCorrelatedScalarSubquery optimization (and TreeNode.transform)

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[SQL] Understanding RewriteCorrelatedScalarSubquery optimization (and TreeNode.transform)

Jacek Laskowski
Hi,

I'm trying to understand RewriteCorrelatedScalarSubquery optimization and how extractCorrelatedScalarSubqueries [1] works. I don't understand how "The expression is rewritten and returned." is done. How is the expression rewritten?

Since it's private it's not even possible to write tests and that got me thinking how you go about code like this? How do you know whether it works fine or not? Any help? I'd appreciate.

Reply | Threaded
Open this post in threaded view
|

Re: [SQL] Understanding RewriteCorrelatedScalarSubquery optimization (and TreeNode.transform)

Jacek Laskowski
Thanks Herman! You've helped me a lot (and am going to use your fine explanation in my gitbook quoting when needed! :))


What about tests? Don't you think the method should have tests? I'm writing small code snippets anyway while exploring it and have been wondering how to contribute it back to the Spark project given the method is private. It looks like I should instead be focusing on the methods of Expression or even QueryPlan to understand the various methods (as that's what triggered my question).

Thanks.


On Mon, May 28, 2018 at 11:33 AM, Herman van Hövell tot Westerflier <[hidden email]> wrote:
Hi Jacek,

RewriteCorrelatedScalarSubquery rewrites a plan containing a scalar subquery into a left join and a projection/filter/aggregate.

For example:
SELECT a_id,
       (SELECT MAX(value)
        FROM tbl_b
        WHERE tbl_b.b_id = tbl_a.a_id) AS max_value
FROM  tbl_a

Will be rewritten into something like this:
SELECT a_id,
       agg.max_value
FROM  tbl_a
      JOIN (SELECT   b_id,
                     MAX(value) AS max_value
            FROM     tbl_b
            GROUP BY b_id) agg
       ON agg.b_id = tbl_a.a_id
     
The particular function you refer to extracts all correlated scalar subqueries by adding them to the subqueries buffer and rewrites the expression to use output of the subquery (the s.plan.output.head bit). This returned expression then replaces the original expression in the operator (Aggregate, Project or Filter) it came from, completing the rewrite for that operator. The extracted subqueries are planned as LEFT OUTER JOINS below the operator.

I hope this makes sense.

Herman





On Sun, May 27, 2018 at 9:43 PM Jacek Laskowski <[hidden email]> wrote:
Hi,

I'm trying to understand RewriteCorrelatedScalarSubquery optimization and how extractCorrelatedScalarSubqueries [1] works. I don't understand how "The expression is rewritten and returned." is done. How is the expression rewritten?

Since it's private it's not even possible to write tests and that got me thinking how you go about code like this? How do you know whether it works fine or not? Any help? I'd appreciate.