Optimize (cache) when the same df is broadcast multiple times.
This post has NOT been accepted by the mailing list yet.
In our application, it happens we join the same dataframe multiple times with several other dataframes (not always the same joining column), in separate queries.
This left-hand side df is not very large, so a broadcast hint may be beneficial.
My questions :
- if the same df get broadcast multiple times, will the transfer occur once (the broadcast data is somehow cached on executors), or multiple times ?
- If the join concern different columns will it be cached as well, or what is broadcast depends on the join key ?
From what I see in the code, debug logs, stats in SparkUI it seems the data is broadcast again every time. Am I correct ?