[DISCUSS] Out of order optimizer rules?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Out of order optimizer rules?

Ryan Blue

Hi everyone,

I have been working on a PR that moves filter and projection pushdown into the optimizer for DSv2, instead of when converting to physical plan. This will make DSv2 work with optimizer rules that depend on stats, like join reordering.

While adding the optimizer rule, I found that some rules appear to be out of order. For example, PruneFileSourcePartitions that handles filter pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that will run after all of the batches in Optimizer (spark-catalyst) including CostBasedJoinReorder.

SparkOptimizer also adds the new “dynamic partition pruning” rules after both the cost-based join reordering and the v1 partition pruning rule. I’m not sure why this should run after join reordering and partition pruning, since it seems to me like additional filters would be good to have before those rules run.

It looks like this might just be that the rules were written in the spark-sql module instead of in catalyst. That makes some sense for the v1 pushdown, which is altering physical plan details (FileIndex) that have leaked into the logical plan. I’m not sure why the dynamic partition pruning rules aren’t in catalyst or why they run after the v1 predicate pushdown.

Can someone more familiar with these rules clarify why they appear to be out of order?

Assuming that this is an accident, I think it’s something that should be fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning may still need to be addressed.

rb

--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Out of order optimizer rules?

cloud0fan
dynamic partition pruning rule generates "hidden" filters that will be converted to real predicates at runtime, so it doesn't matter where we run the rule.

For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better to run it before join reorder.

On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

I have been working on a PR that moves filter and projection pushdown into the optimizer for DSv2, instead of when converting to physical plan. This will make DSv2 work with optimizer rules that depend on stats, like join reordering.

While adding the optimizer rule, I found that some rules appear to be out of order. For example, PruneFileSourcePartitions that handles filter pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that will run after all of the batches in Optimizer (spark-catalyst) including CostBasedJoinReorder.

SparkOptimizer also adds the new “dynamic partition pruning” rules after both the cost-based join reordering and the v1 partition pruning rule. I’m not sure why this should run after join reordering and partition pruning, since it seems to me like additional filters would be good to have before those rules run.

It looks like this might just be that the rules were written in the spark-sql module instead of in catalyst. That makes some sense for the v1 pushdown, which is altering physical plan details (FileIndex) that have leaked into the logical plan. I’m not sure why the dynamic partition pruning rules aren’t in catalyst or why they run after the v1 predicate pushdown.

Can someone more familiar with these rules clarify why they appear to be out of order?

Assuming that this is an accident, I think it’s something that should be fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning may still need to be addressed.

rb

--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Out of order optimizer rules?

Ryan Blue
Where can I find a design doc for dynamic partition pruning that explains how it works?

The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition pruning (as pointed out by Henry R.) and doesn't have any comments about the implementation's approach. And the PR description also doesn't have much information. It lists 3 cases for how a filter is built, but nothing about the overall approach or design that helps when trying to find out where it should be placed in the optimizer rules. It also isn't clear why this couldn't be part of spark-catalyst.

On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan <[hidden email]> wrote:
dynamic partition pruning rule generates "hidden" filters that will be converted to real predicates at runtime, so it doesn't matter where we run the rule.

For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better to run it before join reorder.

On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

I have been working on a PR that moves filter and projection pushdown into the optimizer for DSv2, instead of when converting to physical plan. This will make DSv2 work with optimizer rules that depend on stats, like join reordering.

While adding the optimizer rule, I found that some rules appear to be out of order. For example, PruneFileSourcePartitions that handles filter pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that will run after all of the batches in Optimizer (spark-catalyst) including CostBasedJoinReorder.

SparkOptimizer also adds the new “dynamic partition pruning” rules after both the cost-based join reordering and the v1 partition pruning rule. I’m not sure why this should run after join reordering and partition pruning, since it seems to me like additional filters would be good to have before those rules run.

It looks like this might just be that the rules were written in the spark-sql module instead of in catalyst. That makes some sense for the v1 pushdown, which is altering physical plan details (FileIndex) that have leaked into the logical plan. I’m not sure why the dynamic partition pruning rules aren’t in catalyst or why they run after the v1 predicate pushdown.

Can someone more familiar with these rules clarify why they appear to be out of order?

Assuming that this is an accident, I think it’s something that should be fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning may still need to be addressed.

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Out of order optimizer rules?

rxin
Whoever created the JIRA years ago didn't describe dpp correctly, but the linked jira in Hive was correct (which unfortunately is much more terse than any of the patches we have in Spark https://issues.apache.org/jira/browse/HIVE-9152). Henry R's description was also correct.





On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue <[hidden email]> wrote:
Where can I find a design doc for dynamic partition pruning that explains how it works?

The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition pruning (as pointed out by Henry R.) and doesn't have any comments about the implementation's approach. And the PR description also doesn't have much information. It lists 3 cases for how a filter is built, but nothing about the overall approach or design that helps when trying to find out where it should be placed in the optimizer rules. It also isn't clear why this couldn't be part of spark-catalyst.

On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan <[hidden email]> wrote:
dynamic partition pruning rule generates "hidden" filters that will be converted to real predicates at runtime, so it doesn't matter where we run the rule.

For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better to run it before join reorder.

On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

I have been working on a PR that moves filter and projection pushdown into the optimizer for DSv2, instead of when converting to physical plan. This will make DSv2 work with optimizer rules that depend on stats, like join reordering.

While adding the optimizer rule, I found that some rules appear to be out of order. For example, PruneFileSourcePartitions that handles filter pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that will run after all of the batches in Optimizer (spark-catalyst) including CostBasedJoinReorder.

SparkOptimizer also adds the new “dynamic partition pruning” rules after both the cost-based join reordering and the v1 partition pruning rule. I’m not sure why this should run after join reordering and partition pruning, since it seems to me like additional filters would be good to have before those rules run.

It looks like this might just be that the rules were written in the spark-sql module instead of in catalyst. That makes some sense for the v1 pushdown, which is altering physical plan details (FileIndex) that have leaked into the logical plan. I’m not sure why the dynamic partition pruning rules aren’t in catalyst or why they run after the v1 predicate pushdown.

Can someone more familiar with these rules clarify why they appear to be out of order?

Assuming that this is an accident, I think it’s something that should be fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning may still need to be addressed.

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Out of order optimizer rules?

Maryann Xue
> It lists 3 cases for how a filter is built, but nothing about the overall approach or design that helps when trying to find out where it should be placed in the optimizer rules.

The overall idea/design of DPP can be simply put as using the result of one side of the join to prune partitions of a scan on the other side. The optimal situation is when the join is a broadcast join and the table being partition-pruned is on the probe side. In that case, by the time the probe side starts, the filter will already have the results available and ready for reuse.

Regarding the place in the optimizer rules, it's preferred to happen late in the optimization, and definitely after join reorder.


Thanks,
Maryann

On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin <[hidden email]> wrote:
Whoever created the JIRA years ago didn't describe dpp correctly, but the linked jira in Hive was correct (which unfortunately is much more terse than any of the patches we have in Spark https://issues.apache.org/jira/browse/HIVE-9152). Henry R's description was also correct.





On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue <[hidden email]> wrote:
Where can I find a design doc for dynamic partition pruning that explains how it works?

The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition pruning (as pointed out by Henry R.) and doesn't have any comments about the implementation's approach. And the PR description also doesn't have much information. It lists 3 cases for how a filter is built, but nothing about the overall approach or design that helps when trying to find out where it should be placed in the optimizer rules. It also isn't clear why this couldn't be part of spark-catalyst.

On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan <[hidden email]> wrote:
dynamic partition pruning rule generates "hidden" filters that will be converted to real predicates at runtime, so it doesn't matter where we run the rule.

For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better to run it before join reorder.

On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

I have been working on a PR that moves filter and projection pushdown into the optimizer for DSv2, instead of when converting to physical plan. This will make DSv2 work with optimizer rules that depend on stats, like join reordering.

While adding the optimizer rule, I found that some rules appear to be out of order. For example, PruneFileSourcePartitions that handles filter pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that will run after all of the batches in Optimizer (spark-catalyst) including CostBasedJoinReorder.

SparkOptimizer also adds the new “dynamic partition pruning” rules after both the cost-based join reordering and the v1 partition pruning rule. I’m not sure why this should run after join reordering and partition pruning, since it seems to me like additional filters would be good to have before those rules run.

It looks like this might just be that the rules were written in the spark-sql module instead of in catalyst. That makes some sense for the v1 pushdown, which is altering physical plan details (FileIndex) that have leaked into the logical plan. I’m not sure why the dynamic partition pruning rules aren’t in catalyst or why they run after the v1 predicate pushdown.

Can someone more familiar with these rules clarify why they appear to be out of order?

Assuming that this is an accident, I think it’s something that should be fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning may still need to be addressed.

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Out of order optimizer rules?

Ryan Blue
Thanks for the pointers, but what I'm looking for is information about the design of this implementation, like what requires this to be in spark-sql instead of spark-catalyst.

Even a high-level description, like what the optimizer rules are and what they do would be great. Was there one written up internally that you could share?

On Wed, Oct 2, 2019 at 10:40 AM Maryann Xue <[hidden email]> wrote:
> It lists 3 cases for how a filter is built, but nothing about the overall approach or design that helps when trying to find out where it should be placed in the optimizer rules.

The overall idea/design of DPP can be simply put as using the result of one side of the join to prune partitions of a scan on the other side. The optimal situation is when the join is a broadcast join and the table being partition-pruned is on the probe side. In that case, by the time the probe side starts, the filter will already have the results available and ready for reuse.

Regarding the place in the optimizer rules, it's preferred to happen late in the optimization, and definitely after join reorder.


Thanks,
Maryann

On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin <[hidden email]> wrote:
Whoever created the JIRA years ago didn't describe dpp correctly, but the linked jira in Hive was correct (which unfortunately is much more terse than any of the patches we have in Spark https://issues.apache.org/jira/browse/HIVE-9152). Henry R's description was also correct.





On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue <[hidden email]> wrote:
Where can I find a design doc for dynamic partition pruning that explains how it works?

The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition pruning (as pointed out by Henry R.) and doesn't have any comments about the implementation's approach. And the PR description also doesn't have much information. It lists 3 cases for how a filter is built, but nothing about the overall approach or design that helps when trying to find out where it should be placed in the optimizer rules. It also isn't clear why this couldn't be part of spark-catalyst.

On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan <[hidden email]> wrote:
dynamic partition pruning rule generates "hidden" filters that will be converted to real predicates at runtime, so it doesn't matter where we run the rule.

For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better to run it before join reorder.

On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

I have been working on a PR that moves filter and projection pushdown into the optimizer for DSv2, instead of when converting to physical plan. This will make DSv2 work with optimizer rules that depend on stats, like join reordering.

While adding the optimizer rule, I found that some rules appear to be out of order. For example, PruneFileSourcePartitions that handles filter pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that will run after all of the batches in Optimizer (spark-catalyst) including CostBasedJoinReorder.

SparkOptimizer also adds the new “dynamic partition pruning” rules after both the cost-based join reordering and the v1 partition pruning rule. I’m not sure why this should run after join reordering and partition pruning, since it seems to me like additional filters would be good to have before those rules run.

It looks like this might just be that the rules were written in the spark-sql module instead of in catalyst. That makes some sense for the v1 pushdown, which is altering physical plan details (FileIndex) that have leaked into the logical plan. I’m not sure why the dynamic partition pruning rules aren’t in catalyst or why they run after the v1 predicate pushdown.

Can someone more familiar with these rules clarify why they appear to be out of order?

Assuming that this is an accident, I think it’s something that should be fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning may still need to be addressed.

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Out of order optimizer rules?

Maryann Xue
The reason why it's in spark-sql is simply because HadoopFsRelation which the rule tries to match is in spark-sql.

We should probably update the high-level description in the JIRA. I'll work on that shortly.

On Wed, Oct 2, 2019 at 2:29 PM Ryan Blue <[hidden email]> wrote:
Thanks for the pointers, but what I'm looking for is information about the design of this implementation, like what requires this to be in spark-sql instead of spark-catalyst.

Even a high-level description, like what the optimizer rules are and what they do would be great. Was there one written up internally that you could share?

On Wed, Oct 2, 2019 at 10:40 AM Maryann Xue <[hidden email]> wrote:
> It lists 3 cases for how a filter is built, but nothing about the overall approach or design that helps when trying to find out where it should be placed in the optimizer rules.

The overall idea/design of DPP can be simply put as using the result of one side of the join to prune partitions of a scan on the other side. The optimal situation is when the join is a broadcast join and the table being partition-pruned is on the probe side. In that case, by the time the probe side starts, the filter will already have the results available and ready for reuse.

Regarding the place in the optimizer rules, it's preferred to happen late in the optimization, and definitely after join reorder.


Thanks,
Maryann

On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin <[hidden email]> wrote:
Whoever created the JIRA years ago didn't describe dpp correctly, but the linked jira in Hive was correct (which unfortunately is much more terse than any of the patches we have in Spark https://issues.apache.org/jira/browse/HIVE-9152). Henry R's description was also correct.





On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue <[hidden email]> wrote:
Where can I find a design doc for dynamic partition pruning that explains how it works?

The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition pruning (as pointed out by Henry R.) and doesn't have any comments about the implementation's approach. And the PR description also doesn't have much information. It lists 3 cases for how a filter is built, but nothing about the overall approach or design that helps when trying to find out where it should be placed in the optimizer rules. It also isn't clear why this couldn't be part of spark-catalyst.

On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan <[hidden email]> wrote:
dynamic partition pruning rule generates "hidden" filters that will be converted to real predicates at runtime, so it doesn't matter where we run the rule.

For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better to run it before join reorder.

On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

I have been working on a PR that moves filter and projection pushdown into the optimizer for DSv2, instead of when converting to physical plan. This will make DSv2 work with optimizer rules that depend on stats, like join reordering.

While adding the optimizer rule, I found that some rules appear to be out of order. For example, PruneFileSourcePartitions that handles filter pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that will run after all of the batches in Optimizer (spark-catalyst) including CostBasedJoinReorder.

SparkOptimizer also adds the new “dynamic partition pruning” rules after both the cost-based join reordering and the v1 partition pruning rule. I’m not sure why this should run after join reordering and partition pruning, since it seems to me like additional filters would be good to have before those rules run.

It looks like this might just be that the rules were written in the spark-sql module instead of in catalyst. That makes some sense for the v1 pushdown, which is altering physical plan details (FileIndex) that have leaked into the logical plan. I’m not sure why the dynamic partition pruning rules aren’t in catalyst or why they run after the v1 predicate pushdown.

Can someone more familiar with these rules clarify why they appear to be out of order?

Assuming that this is an accident, I think it’s something that should be fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning may still need to be addressed.

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Out of order optimizer rules?

rxin
In reply to this post by Ryan Blue
No there is no separate write up internally.

On Wed, Oct 2, 2019 at 12:29 PM Ryan Blue <[hidden email]> wrote:
Thanks for the pointers, but what I'm looking for is information about the design of this implementation, like what requires this to be in spark-sql instead of spark-catalyst.

Even a high-level description, like what the optimizer rules are and what they do would be great. Was there one written up internally that you could share?

On Wed, Oct 2, 2019 at 10:40 AM Maryann Xue <[hidden email]> wrote:
> It lists 3 cases for how a filter is built, but nothing about the overall approach or design that helps when trying to find out where it should be placed in the optimizer rules.

The overall idea/design of DPP can be simply put as using the result of one side of the join to prune partitions of a scan on the other side. The optimal situation is when the join is a broadcast join and the table being partition-pruned is on the probe side. In that case, by the time the probe side starts, the filter will already have the results available and ready for reuse.

Regarding the place in the optimizer rules, it's preferred to happen late in the optimization, and definitely after join reorder.


Thanks,
Maryann

On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin <[hidden email]> wrote:
Whoever created the JIRA years ago didn't describe dpp correctly, but the linked jira in Hive was correct (which unfortunately is much more terse than any of the patches we have in Spark https://issues.apache.org/jira/browse/HIVE-9152). Henry R's description was also correct.





On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue <[hidden email]> wrote:
Where can I find a design doc for dynamic partition pruning that explains how it works?

The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition pruning (as pointed out by Henry R.) and doesn't have any comments about the implementation's approach. And the PR description also doesn't have much information. It lists 3 cases for how a filter is built, but nothing about the overall approach or design that helps when trying to find out where it should be placed in the optimizer rules. It also isn't clear why this couldn't be part of spark-catalyst.

On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan <[hidden email]> wrote:
dynamic partition pruning rule generates "hidden" filters that will be converted to real predicates at runtime, so it doesn't matter where we run the rule.

For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better to run it before join reorder.

On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

I have been working on a PR that moves filter and projection pushdown into the optimizer for DSv2, instead of when converting to physical plan. This will make DSv2 work with optimizer rules that depend on stats, like join reordering.

While adding the optimizer rule, I found that some rules appear to be out of order. For example, PruneFileSourcePartitions that handles filter pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that will run after all of the batches in Optimizer (spark-catalyst) including CostBasedJoinReorder.

SparkOptimizer also adds the new “dynamic partition pruning” rules after both the cost-based join reordering and the v1 partition pruning rule. I’m not sure why this should run after join reordering and partition pruning, since it seems to me like additional filters would be good to have before those rules run.

It looks like this might just be that the rules were written in the spark-sql module instead of in catalyst. That makes some sense for the v1 pushdown, which is altering physical plan details (FileIndex) that have leaked into the logical plan. I’m not sure why the dynamic partition pruning rules aren’t in catalyst or why they run after the v1 predicate pushdown.

Can someone more familiar with these rules clarify why they appear to be out of order?

Assuming that this is an accident, I think it’s something that should be fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning may still need to be addressed.

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Out of order optimizer rules?

Maryann Xue
There is no internal write up, but I think we should at least give some up-to-date description on that JIRA entry.

On Wed, Oct 2, 2019 at 3:13 PM Reynold Xin <[hidden email]> wrote:
No there is no separate write up internally.

On Wed, Oct 2, 2019 at 12:29 PM Ryan Blue <[hidden email]> wrote:
Thanks for the pointers, but what I'm looking for is information about the design of this implementation, like what requires this to be in spark-sql instead of spark-catalyst.

Even a high-level description, like what the optimizer rules are and what they do would be great. Was there one written up internally that you could share?

On Wed, Oct 2, 2019 at 10:40 AM Maryann Xue <[hidden email]> wrote:
> It lists 3 cases for how a filter is built, but nothing about the overall approach or design that helps when trying to find out where it should be placed in the optimizer rules.

The overall idea/design of DPP can be simply put as using the result of one side of the join to prune partitions of a scan on the other side. The optimal situation is when the join is a broadcast join and the table being partition-pruned is on the probe side. In that case, by the time the probe side starts, the filter will already have the results available and ready for reuse.

Regarding the place in the optimizer rules, it's preferred to happen late in the optimization, and definitely after join reorder.


Thanks,
Maryann

On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin <[hidden email]> wrote:
Whoever created the JIRA years ago didn't describe dpp correctly, but the linked jira in Hive was correct (which unfortunately is much more terse than any of the patches we have in Spark https://issues.apache.org/jira/browse/HIVE-9152). Henry R's description was also correct.





On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue <[hidden email]> wrote:
Where can I find a design doc for dynamic partition pruning that explains how it works?

The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition pruning (as pointed out by Henry R.) and doesn't have any comments about the implementation's approach. And the PR description also doesn't have much information. It lists 3 cases for how a filter is built, but nothing about the overall approach or design that helps when trying to find out where it should be placed in the optimizer rules. It also isn't clear why this couldn't be part of spark-catalyst.

On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan <[hidden email]> wrote:
dynamic partition pruning rule generates "hidden" filters that will be converted to real predicates at runtime, so it doesn't matter where we run the rule.

For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better to run it before join reorder.

On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

I have been working on a PR that moves filter and projection pushdown into the optimizer for DSv2, instead of when converting to physical plan. This will make DSv2 work with optimizer rules that depend on stats, like join reordering.

While adding the optimizer rule, I found that some rules appear to be out of order. For example, PruneFileSourcePartitions that handles filter pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that will run after all of the batches in Optimizer (spark-catalyst) including CostBasedJoinReorder.

SparkOptimizer also adds the new “dynamic partition pruning” rules after both the cost-based join reordering and the v1 partition pruning rule. I’m not sure why this should run after join reordering and partition pruning, since it seems to me like additional filters would be good to have before those rules run.

It looks like this might just be that the rules were written in the spark-sql module instead of in catalyst. That makes some sense for the v1 pushdown, which is altering physical plan details (FileIndex) that have leaked into the logical plan. I’m not sure why the dynamic partition pruning rules aren’t in catalyst or why they run after the v1 predicate pushdown.

Can someone more familiar with these rules clarify why they appear to be out of order?

Assuming that this is an accident, I think it’s something that should be fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning may still need to be addressed.

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Out of order optimizer rules?

rxin
I just looked at the PR. I think there are some follow up work that needs to be done, e.g. we shouldn't create a top level package org.apache.spark.sql.dynamicpruning.


On Wed, Oct 02, 2019 at 1:52 PM, Maryann Xue <[hidden email]> wrote:
There is no internal write up, but I think we should at least give some up-to-date description on that JIRA entry.

On Wed, Oct 2, 2019 at 3:13 PM Reynold Xin <[hidden email]> wrote:
No there is no separate write up internally.

On Wed, Oct 2, 2019 at 12:29 PM Ryan Blue <[hidden email]> wrote:
Thanks for the pointers, but what I'm looking for is information about the design of this implementation, like what requires this to be in spark-sql instead of spark-catalyst.

Even a high-level description, like what the optimizer rules are and what they do would be great. Was there one written up internally that you could share?

On Wed, Oct 2, 2019 at 10:40 AM Maryann Xue <[hidden email]> wrote:
> It lists 3 cases for how a filter is built, but nothing about the overall approach or design that helps when trying to find out where it should be placed in the optimizer rules.

The overall idea/design of DPP can be simply put as using the result of one side of the join to prune partitions of a scan on the other side. The optimal situation is when the join is a broadcast join and the table being partition-pruned is on the probe side. In that case, by the time the probe side starts, the filter will already have the results available and ready for reuse.

Regarding the place in the optimizer rules, it's preferred to happen late in the optimization, and definitely after join reorder.


Thanks,
Maryann

On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin <[hidden email]> wrote:
Whoever created the JIRA years ago didn't describe dpp correctly, but the linked jira in Hive was correct (which unfortunately is much more terse than any of the patches we have in Spark https://issues.apache.org/jira/browse/HIVE-9152). Henry R's description was also correct.





On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue <[hidden email]> wrote:
Where can I find a design doc for dynamic partition pruning that explains how it works?

The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition pruning (as pointed out by Henry R.) and doesn't have any comments about the implementation's approach. And the PR description also doesn't have much information. It lists 3 cases for how a filter is built, but nothing about the overall approach or design that helps when trying to find out where it should be placed in the optimizer rules. It also isn't clear why this couldn't be part of spark-catalyst.

On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan <[hidden email]> wrote:
dynamic partition pruning rule generates "hidden" filters that will be converted to real predicates at runtime, so it doesn't matter where we run the rule.

For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better to run it before join reorder.

On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

I have been working on a PR that moves filter and projection pushdown into the optimizer for DSv2, instead of when converting to physical plan. This will make DSv2 work with optimizer rules that depend on stats, like join reordering.

While adding the optimizer rule, I found that some rules appear to be out of order. For example, PruneFileSourcePartitions that handles filter pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that will run after all of the batches in Optimizer (spark-catalyst) including CostBasedJoinReorder.

SparkOptimizer also adds the new “dynamic partition pruning” rules after both the cost-based join reordering and the v1 partition pruning rule. I’m not sure why this should run after join reordering and partition pruning, since it seems to me like additional filters would be good to have before those rules run.

It looks like this might just be that the rules were written in the spark-sql module instead of in catalyst. That makes some sense for the v1 pushdown, which is altering physical plan details (FileIndex) that have leaked into the logical plan. I’m not sure why the dynamic partition pruning rules aren’t in catalyst or why they run after the v1 predicate pushdown.

Can someone more familiar with these rules clarify why they appear to be out of order?

Assuming that this is an accident, I think it’s something that should be fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning may still need to be addressed.

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix