Why not implement CodegenSupport in class ShuffledHashJoinExec?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Why not implement CodegenSupport in class ShuffledHashJoinExec?

Wang, Gang

There are some cases, shuffle hash join performs even better than sort merge join.

While, I noticed that ShuffledHashJoinExec does not implement CodegenSupport, is there any concern? And if there is any chance to improve the performance of ShuffledHashJoinExec?

 

Reply | Threaded
Open this post in threaded view
|

Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

cloud0fan
By default sort merge join is preferred over shuffle hash join, that's why we haven't spend resources to implement codegen for it.

On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang <[hidden email]> wrote:

There are some cases, shuffle hash join performs even better than sort merge join.

While, I noticed that ShuffledHashJoinExec does not implement CodegenSupport, is there any concern? And if there is any chance to improve the performance of ShuffledHashJoinExec?

 

Reply | Threaded
Open this post in threaded view
|

Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

Wang, Gang

That’s right. By default, Spark prefers sort merge join.

While, in our product environment, there are many huge bucket tables. We can leverage the bucketing to avoid shuffle when join with other small tables (the small tables are not small enough to leverage broad cast join). Problem is that, although shuffle can be avoid, sort is still necessary to leverage sort merge join (we cannot pre-sort since there are different join patterns). For a huge table, sort may take even tens of seconds.

That’s why I’m trying to enable shuffle hash join, and for such cases, there were almost 10% ~ 20% improvement when apply shuffle hash join instead of sort merge join. I wonder if there is still some space to improve shuffle hash join? Like code generation for ShuffledHashJoinExec or something….

 

From: Wenchen Fan <[hidden email]>
Date: Sunday, November 10, 2019 at 5:57 PM
To: "Wang, Gang" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

 

By default sort merge join is preferred over shuffle hash join, that's why we haven't spend resources to implement codegen for it.

 

On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang <[hidden email]> wrote:

There are some cases, shuffle hash join performs even better than sort merge join.

While, I noticed that ShuffledHashJoinExec does not implement CodegenSupport, is there any concern? And if there is any chance to improve the performance of ShuffledHashJoinExec?

 

Reply | Threaded
Open this post in threaded view
|

Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

cloud0fan
Yea codegen can be a good improvement, PRs are welcome!

On Sun, Nov 10, 2019 at 6:28 PM Wang, Gang <[hidden email]> wrote:

That’s right. By default, Spark prefers sort merge join.

While, in our product environment, there are many huge bucket tables. We can leverage the bucketing to avoid shuffle when join with other small tables (the small tables are not small enough to leverage broad cast join). Problem is that, although shuffle can be avoid, sort is still necessary to leverage sort merge join (we cannot pre-sort since there are different join patterns). For a huge table, sort may take even tens of seconds.

That’s why I’m trying to enable shuffle hash join, and for such cases, there were almost 10% ~ 20% improvement when apply shuffle hash join instead of sort merge join. I wonder if there is still some space to improve shuffle hash join? Like code generation for ShuffledHashJoinExec or something….

 

From: Wenchen Fan <[hidden email]>
Date: Sunday, November 10, 2019 at 5:57 PM
To: "Wang, Gang" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

 

By default sort merge join is preferred over shuffle hash join, that's why we haven't spend resources to implement codegen for it.

 

On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang <[hidden email]> wrote:

There are some cases, shuffle hash join performs even better than sort merge join.

While, I noticed that ShuffledHashJoinExec does not implement CodegenSupport, is there any concern? And if there is any chance to improve the performance of ShuffledHashJoinExec?