Transform plan with scope

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Transform plan with scope

Marco Gaido
Hi all,

working on SPARK-24051 I realized that currently in the Optimizer and in all the places where we are transforming a query plan, we are lacking the context information of what is in scope and what is not.

Coming back to the ticket, the bug reported in the ticket is caused mainly by two reasons:
 1 - we have two aliases in different places of the plan;
 2 - (the focus of this email) we apply all the rules globally over the whole plan, without any notion of scope where something is reachable/visible or not.

I will start with an easy example to explain what I mean. If we have a simple query like:

select a, b from (
  select 1 as a, 2 as b from table1
    union
  select 3 as a, 4 as b from table2) q

We produce a tree which is logically something like:

Project0(a, b)
-   Union
--    Project1 (a, b)
---     ScanTable1
--    Project 2(a, b)
---     ScanTable2

So when we apply a transformation on Project1 for instance, we have no information about what is coming from ScanTable1 (or in general any node which is part of the subtree whose root is Project1): we miss a stateful transform which allows the children to tell the parent, grandparents, and so on what is in their scope. This is in particular true for the Attributes: in a node we have no idea if an Attribute comes from its subtree (it is in scope) or not.

So, the point of this email is: do you think in general might be useful to introduce a way of navigating the tree which allows the children to keep a state to be used by their parents? Or do you think it is useful in general to introduce the concept of scope (if an attribute can be accessed by a node of a plan)?

Thanks,
Marco


Reply | Threaded
Open this post in threaded view
|

Re: Transform plan with scope

Joseph Torres
Is there some transformation we'd want to apply to that tree, but can't because we have no concept of scope? It's already possible for a plan rule to traverse each node's subtree if it wants.

On Tue, Apr 24, 2018 at 10:18 AM, Marco Gaido <[hidden email]> wrote:
Hi all,

working on SPARK-24051 I realized that currently in the Optimizer and in all the places where we are transforming a query plan, we are lacking the context information of what is in scope and what is not.

Coming back to the ticket, the bug reported in the ticket is caused mainly by two reasons:
 1 - we have two aliases in different places of the plan;
 2 - (the focus of this email) we apply all the rules globally over the whole plan, without any notion of scope where something is reachable/visible or not.

I will start with an easy example to explain what I mean. If we have a simple query like:

select a, b from (
  select 1 as a, 2 as b from table1
    union
  select 3 as a, 4 as b from table2) q

We produce a tree which is logically something like:

Project0(a, b)
-   Union
--    Project1 (a, b)
---     ScanTable1
--    Project 2(a, b)
---     ScanTable2

So when we apply a transformation on Project1 for instance, we have no information about what is coming from ScanTable1 (or in general any node which is part of the subtree whose root is Project1): we miss a stateful transform which allows the children to tell the parent, grandparents, and so on what is in their scope. This is in particular true for the Attributes: in a node we have no idea if an Attribute comes from its subtree (it is in scope) or not.

So, the point of this email is: do you think in general might be useful to introduce a way of navigating the tree which allows the children to keep a state to be used by their parents? Or do you think it is useful in general to introduce the concept of scope (if an attribute can be accessed by a node of a plan)?

Thanks,
Marco



Reply | Threaded
Open this post in threaded view
|

Re: Transform plan with scope

Marco Gaido
Hi Joseph, Herman,

thanks for your answers. The specific rule I was looking at is FoldablePropagation. If you look at it, what is done is that first a AttributeMap containing all the possible foldable alias is collected, then they are replace in the whole plan (it is a bit more complex than this, I know, this is just a simplification). So in this scenario, if we have two aliases in completely separated trees we are nonetheless replacing them, since we have no idea of which is the scope where each of them is available in.

I know that this specific problem can be solved in a much easier way: we are facing a weird situation where there are two aliases with the same exprId and this is not a very good situation (we can just fix it in the analyzer, as you proposed Herman). But logically, it would make more sense that a replacement like this is enforced only where an attribute is in scope (in other places it should not occur) and I think that if we perform the things as they are logically needed we are less likely to introduce in weird bugs caused by situations which we never thought as possible (or which they are not, but they may become). So I was thinking that if such an operation could be useful also in other places, then probably we could introduce it: that is the reason of this email thread, understanding if we need it or if my idea is worthless (in this case, I apologize for wasting your time).

Yes, Herman, the management of the state is the hardest point. My best idea so far (but I am not satisfied with it) is to enforce that the state which is passed from the each child to the parent extends both Growable and Shrinkable and we pass two functions which for each node return respectively the items to discard form the state and to add to it. But in the case we think/decide that such an operation I proposed might be useful, probably we can spend more time on investigating the best solution (any suggestion in case would be very welcomed).

Any more thoughts on this?

Thanks for your answers and your time,
Marco

2018-04-24 19:47 GMT+02:00 Herman van Hövell tot Westerflier <[hidden email]>:
Hi Marco,

In the case of SPARK-24015 we should perhaps fix this in the analyzer. It seems that the plan is somewhat invalid.

If you do want to fix it in the optimizer we might be able to fix it using an approach for similar to RemoveRedundantAliases (which manually recurses down the tree).

As for your proposal we could explore this a little bit. My main question would be how would you pass information up the tree (from child to parent), and how would you merge such information if there are multiple children? It might be kind of painful to generalize.

- Herman

On Tue, Apr 24, 2018 at 7:37 PM, Joseph Torres <[hidden email]> wrote:
Is there some transformation we'd want to apply to that tree, but can't because we have no concept of scope? It's already possible for a plan rule to traverse each node's subtree if it wants.

On Tue, Apr 24, 2018 at 10:18 AM, Marco Gaido <[hidden email]> wrote:
Hi all,

working on SPARK-24051 I realized that currently in the Optimizer and in all the places where we are transforming a query plan, we are lacking the context information of what is in scope and what is not.

Coming back to the ticket, the bug reported in the ticket is caused mainly by two reasons:
 1 - we have two aliases in different places of the plan;
 2 - (the focus of this email) we apply all the rules globally over the whole plan, without any notion of scope where something is reachable/visible or not.

I will start with an easy example to explain what I mean. If we have a simple query like:

select a, b from (
  select 1 as a, 2 as b from table1
    union
  select 3 as a, 4 as b from table2) q

We produce a tree which is logically something like:

Project0(a, b)
-   Union
--    Project1 (a, b)
---     ScanTable1
--    Project 2(a, b)
---     ScanTable2

So when we apply a transformation on Project1 for instance, we have no information about what is coming from ScanTable1 (or in general any node which is part of the subtree whose root is Project1): we miss a stateful transform which allows the children to tell the parent, grandparents, and so on what is in their scope. This is in particular true for the Attributes: in a node we have no idea if an Attribute comes from its subtree (it is in scope) or not.

So, the point of this email is: do you think in general might be useful to introduce a way of navigating the tree which allows the children to keep a state to be used by their parents? Or do you think it is useful in general to introduce the concept of scope (if an attribute can be accessed by a node of a plan)?

Thanks,
Marco