A question about broadcast join in spark2.2.0 and spark2.3.1

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

A question about broadcast join in spark2.2.0 and spark2.3.1

zhangliyun
Hi all:
  i want to ask a question about how to change common join to broadcast join.
  I have a query like in spark 2.2.0 and spark 2.3.1 seperately
```
select * from A left join B
on A.mth_id - 12 =  B.mth_id  and A.email = B.email.
```
 The value of spark.sql.autoBroadcastJoinThreshold is same ( spark.sql.autoBroadcastJoinThreshold=700000000 700M), but for some reason, it became 
broadcast join in 2.2.0 while  a sort merged join in spark 2.3.1.  It is strange that in spark 2.2.0 the broadcast data size is 0 bytes but actually i think the value is not 0.  The problem here this query should use broadcast join but i don't know why the broadcast data size is 0 in spark2.2 and why this  is not a broadcast join in 2.3.1. Is there any difference about the threshold of broadcast join in spark 2.2.0 and 2.3.1? Is the parameter "spark.sql.autoBroadcastJoinThreshold" only factor which decide the broadcast or sort merge join? Appreciate that if you can give me some suggestion.