Optimize (cache) when the same df is broadcast multiple times.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Optimize (cache) when the same df is broadcast multiple times.

matd
This post has NOT been accepted by the mailing list yet.
Hi folks,

In our application, it happens we join the same dataframe multiple times with several other dataframes (not always the same joining column), in separate queries.

This left-hand side df is not very large, so a broadcast hint may be beneficial.

My questions :
- if the same df get broadcast multiple times, will the transfer occur once (the broadcast data is somehow cached on executors), or multiple times ?
- If the join concern different columns will it be cached as well, or what is broadcast depends on the join key ?

From what I see in the code, debug logs, stats in SparkUI it seems the data is broadcast again every time. Am I correct ?

Would this merit a JIRA ?

Thanks
Mathieu
Loading...