[SPARK SQL] Make max multi table join limit configurable in OptimizeSkewedJoin

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[SPARK SQL] Make max multi table join limit configurable in OptimizeSkewedJoin

Alfie Davidson
Hi All,

First time contributing, so reaching out by email before creating a JIRA ticket and PR. I would like to propose a small change/enhancement to OptimizeSkewedJoin.

Currently, OptimizeSkewedJoin has a hardcoded limit for multi table joins (limit = 2). For processes that have multiple joins (n > 2) OptimizeSkewedJoin will only be considered for two of the n joins.

Code comment suggests it is currently defaulted to 2 due to too many complex combinations to consider etc, however, it would be good to allow users to override/configure this via the Spark Config, as complexity can be use case dependent.

Proposal:
  1. Add spark.sql.adaptive.skewJoin.maxMultiTableJoin (default = 2) to SQLConf
  2. Update OptimizeSkewedJoin to consider above configuration
  3. If user sets > 2 log a warning to indicate complexity

If people think this is a good idea and useful please let me know and I will proceed.

Kind Regards,

Alfie