Need help with HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Need help with HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap

Jacek Laskowski
Hi Spark Devs,

I really need your help understanding the relationship between HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap.

While exploring UnsafeFixedWidthAggregationMap and how it's used I've noticed that it's for HashAggregateExec and TungstenAggregationIterator exclusively. And given that TungstenAggregationIterator is used exclusively in HashAggregateExec and the use of UnsafeFixedWidthAggregationMap in both seems to be almost the same (if not the same), I've got a question I cannot seem to answer myself.

Since HashAggregateExec supports Whole-Stage Codegen HashAggregateExec.doExecute won't be used at all, but doConsume and doProduce (unless codegen is disabled). Is that correct?

If so, TungstenAggregationIterator is not used at all, but UnsafeFixedWidthAggregationMap is used directly instead (in the Java code that uses createHashMap or finishAggregate). Is that correct?
Reply | Threaded
Open this post in threaded view
|

Re: Need help with HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap

Jacek Laskowski
Hi Devs,

Sorry for bothering you with my questions (and concerns), but I really need to understand this piece of code (= my personal challenge :))

Is this true that SparkPlan.doExecute (to "execute" a physical operator) is only used when whole-stage code gen is disabled (which is not by default)? May I call this execution path traditional (even "old-fashioned")?

Is this true that these days SparkPlan.doProduce and SparkPlan.doConsume (and others) are used for "executing" a physical operator (i.e. to generate the Java source code) since whole-stage code generation is enabled and is currently the proper execution path?

p.s. This SparkPlan.doExecute is used to trigger whole-stage code gen by WholeStageCodegenExec (and InputAdapter), but that's all the code that is to be executed by doExecute, isn't it?

On Fri, Sep 7, 2018 at 7:24 PM Jacek Laskowski <[hidden email]> wrote:
Hi Spark Devs,

I really need your help understanding the relationship between HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap.

While exploring UnsafeFixedWidthAggregationMap and how it's used I've noticed that it's for HashAggregateExec and TungstenAggregationIterator exclusively. And given that TungstenAggregationIterator is used exclusively in HashAggregateExec and the use of UnsafeFixedWidthAggregationMap in both seems to be almost the same (if not the same), I've got a question I cannot seem to answer myself.

Since HashAggregateExec supports Whole-Stage Codegen HashAggregateExec.doExecute won't be used at all, but doConsume and doProduce (unless codegen is disabled). Is that correct?

If so, TungstenAggregationIterator is not used at all, but UnsafeFixedWidthAggregationMap is used directly instead (in the Java code that uses createHashMap or finishAggregate). Is that correct?
Reply | Threaded
Open this post in threaded view
|

Re: Need help with HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap

RussS
That's my understanding :) doExecute is for non-codegen while doProduce and Consume are for generating code

On Fri, Sep 7, 2018 at 2:59 PM Jacek Laskowski <[hidden email]> wrote:
Hi Devs,

Sorry for bothering you with my questions (and concerns), but I really need to understand this piece of code (= my personal challenge :))

Is this true that SparkPlan.doExecute (to "execute" a physical operator) is only used when whole-stage code gen is disabled (which is not by default)? May I call this execution path traditional (even "old-fashioned")?

Is this true that these days SparkPlan.doProduce and SparkPlan.doConsume (and others) are used for "executing" a physical operator (i.e. to generate the Java source code) since whole-stage code generation is enabled and is currently the proper execution path?

p.s. This SparkPlan.doExecute is used to trigger whole-stage code gen by WholeStageCodegenExec (and InputAdapter), but that's all the code that is to be executed by doExecute, isn't it?

On Fri, Sep 7, 2018 at 7:24 PM Jacek Laskowski <[hidden email]> wrote:
Hi Spark Devs,

I really need your help understanding the relationship between HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap.

While exploring UnsafeFixedWidthAggregationMap and how it's used I've noticed that it's for HashAggregateExec and TungstenAggregationIterator exclusively. And given that TungstenAggregationIterator is used exclusively in HashAggregateExec and the use of UnsafeFixedWidthAggregationMap in both seems to be almost the same (if not the same), I've got a question I cannot seem to answer myself.

Since HashAggregateExec supports Whole-Stage Codegen HashAggregateExec.doExecute won't be used at all, but doConsume and doProduce (unless codegen is disabled). Is that correct?

If so, TungstenAggregationIterator is not used at all, but UnsafeFixedWidthAggregationMap is used directly instead (in the Java code that uses createHashMap or finishAggregate). Is that correct?
Reply | Threaded
Open this post in threaded view
|

Re: Need help with HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap

Jacek Laskowski
Thanks Russ! That helps a lot. 

On the other hand makes reviewing the codebase of Spark SQL slightly harder since Java code generation is so much about string concatenation :(

p.s. Should all the code in doExecute be considered and marked @deprecated?

On Fri, Sep 7, 2018 at 10:05 PM Russell Spitzer <[hidden email]> wrote:
That's my understanding :) doExecute is for non-codegen while doProduce and Consume are for generating code

On Fri, Sep 7, 2018 at 2:59 PM Jacek Laskowski <[hidden email]> wrote:
Hi Devs,

Sorry for bothering you with my questions (and concerns), but I really need to understand this piece of code (= my personal challenge :))

Is this true that SparkPlan.doExecute (to "execute" a physical operator) is only used when whole-stage code gen is disabled (which is not by default)? May I call this execution path traditional (even "old-fashioned")?

Is this true that these days SparkPlan.doProduce and SparkPlan.doConsume (and others) are used for "executing" a physical operator (i.e. to generate the Java source code) since whole-stage code generation is enabled and is currently the proper execution path?

p.s. This SparkPlan.doExecute is used to trigger whole-stage code gen by WholeStageCodegenExec (and InputAdapter), but that's all the code that is to be executed by doExecute, isn't it?

On Fri, Sep 7, 2018 at 7:24 PM Jacek Laskowski <[hidden email]> wrote:
Hi Spark Devs,

I really need your help understanding the relationship between HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap.

While exploring UnsafeFixedWidthAggregationMap and how it's used I've noticed that it's for HashAggregateExec and TungstenAggregationIterator exclusively. And given that TungstenAggregationIterator is used exclusively in HashAggregateExec and the use of UnsafeFixedWidthAggregationMap in both seems to be almost the same (if not the same), I've got a question I cannot seem to answer myself.

Since HashAggregateExec supports Whole-Stage Codegen HashAggregateExec.doExecute won't be used at all, but doConsume and doProduce (unless codegen is disabled). Is that correct?

If so, TungstenAggregationIterator is not used at all, but UnsafeFixedWidthAggregationMap is used directly instead (in the Java code that uses createHashMap or finishAggregate). Is that correct?
Reply | Threaded
Open this post in threaded view
|

Re: Need help with HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap

Jacek Laskowski
Hi Herman,

Right. No @deprecated, but something that would tell people who review the code "be extra careful since you're reading code that is no longer in use" for SparkPlans that do support WSCG. That would help a lot as I got tricked few times already while trying to understand something that I should not have been bothered much with.

Thanks Russ and Herman for your help to get my thinking right. That will also help my Spark clients, esp. during Spark SQL workshops!

On Sat, Sep 8, 2018 at 3:53 PM Herman van Hovell <[hidden email]> wrote:
...pressed send to early...

Moreover the we can't always use whole stage code generation. In that case we fall back to vulcano style execution, and chain together doExecute() calls.

On Sat, Sep 8, 2018 at 3:51 PM Herman van Hovell <[hidden email]> wrote:
SparkPlan.doExecute() is the only way you can execute a physical SQL plan, so it should not be marked as deprecated. Wholestage code generation collapses a subtree of SparkPlans (that support whole stage codegeneration) into a single WholeStageCodegenExec pyhsical plan. During execution we call doExecute() on the WholeStageCodegenExec node. 

On Sat, Sep 8, 2018 at 11:55 AM Jacek Laskowski <[hidden email]> wrote:
Thanks Russ! That helps a lot. 

On the other hand makes reviewing the codebase of Spark SQL slightly harder since Java code generation is so much about string concatenation :(

p.s. Should all the code in doExecute be considered and marked @deprecated?

On Fri, Sep 7, 2018 at 10:05 PM Russell Spitzer <[hidden email]> wrote:
That's my understanding :) doExecute is for non-codegen while doProduce and Consume are for generating code

On Fri, Sep 7, 2018 at 2:59 PM Jacek Laskowski <[hidden email]> wrote:
Hi Devs,

Sorry for bothering you with my questions (and concerns), but I really need to understand this piece of code (= my personal challenge :))

Is this true that SparkPlan.doExecute (to "execute" a physical operator) is only used when whole-stage code gen is disabled (which is not by default)? May I call this execution path traditional (even "old-fashioned")?

Is this true that these days SparkPlan.doProduce and SparkPlan.doConsume (and others) are used for "executing" a physical operator (i.e. to generate the Java source code) since whole-stage code generation is enabled and is currently the proper execution path?

p.s. This SparkPlan.doExecute is used to trigger whole-stage code gen by WholeStageCodegenExec (and InputAdapter), but that's all the code that is to be executed by doExecute, isn't it?

On Fri, Sep 7, 2018 at 7:24 PM Jacek Laskowski <[hidden email]> wrote:
Hi Spark Devs,

I really need your help understanding the relationship between HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap.

While exploring UnsafeFixedWidthAggregationMap and how it's used I've noticed that it's for HashAggregateExec and TungstenAggregationIterator exclusively. And given that TungstenAggregationIterator is used exclusively in HashAggregateExec and the use of UnsafeFixedWidthAggregationMap in both seems to be almost the same (if not the same), I've got a question I cannot seem to answer myself.

Since HashAggregateExec supports Whole-Stage Codegen HashAggregateExec.doExecute won't be used at all, but doConsume and doProduce (unless codegen is disabled). Is that correct?

If so, TungstenAggregationIterator is not used at all, but UnsafeFixedWidthAggregationMap is used directly instead (in the Java code that uses createHashMap or finishAggregate). Is that correct?