Scala left join with multiple columns Join condition is missing or trivial. Use the CROSS JOIN syntax to allow cartesian products between these relations.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Scala left join with multiple columns Join condition is missing or trivial. Use the CROSS JOIN syntax to allow cartesian products between these relations.

gjohnson35
The join condition with && is throwing an exception:  

 val df = baseDF.join(mccDF, mccDF("medical_claim_id") <=> baseDF("medical_claim_id")
      && mccDF("medical_claim_detail_id") <=> baseDF("medical_claim_detail_id"), "left")
      .join(revCdDF, revCdDF("revenue_code_padded_str") <=> mccDF("mcc_code"), "left")
      .select(baseDF("medical_claim_id"), baseDF("medical_claim_detail_id"), baseDF("revenue_code"), baseDF("rev_code_distinct_count"),
        baseDF("rtos_1_1_count"), baseDF("rtos_1_0_count"), baseDF("er_visit_flag"), baseDF("observation_stay_flag"),
        revCdDF("rtos_2_code"), revCdDF("rtos_2_hierarchy"))
      .where(revCdDF("rtos_2_code").between(8, 27).isNotNull)
      .groupBy(
        baseDF("medical_claim_id"),
        baseDF("medical_claim_detail_id")
      )
      .agg(min(revCdDF("rtos_2_code").alias("min_rtos_2_8_thru_27")), min(revCdDF("rtos_2_hierarchy").alias("min_rtos_2_8_thru_27_hier")))


This query runs fine:

val df = baseDF.join(mccDF, mccDF("medical_claim_id") <=> baseDF("medical_claim_id"), "left")
        .join(mccDF, mccDF("medical_claim_detail_id") <=> baseDF("medical_claim_detail_id"), "left")
      .join(revCdDF, revCdDF("revenue_code_padded_str") <=> mccDF("mcc_code"), "left")
      .select(baseDF("medical_claim_id"), baseDF("medical_claim_detail_id"), baseDF("revenue_code"), baseDF("rev_code_distinct_count"),
        baseDF("rtos_1_1_count"), baseDF("rtos_1_0_count"), baseDF("er_visit_flag"), baseDF("observation_stay_flag"),
        revCdDF("rtos_2_code"), revCdDF("rtos_2_hierarchy"))
      .where(revCdDF("rtos_2_code").between(8, 27).isNotNull)
      .groupBy(
        baseDF("medical_claim_id"),
        baseDF("medical_claim_detail_id")
      )
      .agg(min(revCdDF("rtos_2_code").alias("min_rtos_2_8_thru_27")), min(revCdDF("rtos_2_hierarchy").alias("min_rtos_2_8_thru_27_hier")))

If I remove the multiple Columns in the join and create a join statement for each one then the exception goes away.  Is there a better way to join multiple columns?

Reply | Threaded
Open this post in threaded view
|

Re: Scala left join with multiple columns Join condition is missing or trivial. Use the CROSS JOIN syntax to allow cartesian products between these relations.

Andrew Ray
You probably don't want null safe equals (<=>) with a left join. 

On Mon, Apr 3, 2017 at 5:46 PM gjohnson35 <[hidden email]> wrote:
The join condition with && is throwing an exception:

 val df = baseDF.join(mccDF, mccDF("medical_claim_id") <=>
baseDF("medical_claim_id")
      && mccDF("medical_claim_detail_id") <=>
baseDF("medical_claim_detail_id"), "left")
      .join(revCdDF, revCdDF("revenue_code_padded_str") <=>
mccDF("mcc_code"), "left")
      .select(baseDF("medical_claim_id"), baseDF("medical_claim_detail_id"),
baseDF("revenue_code"), baseDF("rev_code_distinct_count"),
        baseDF("rtos_1_1_count"), baseDF("rtos_1_0_count"),
baseDF("er_visit_flag"), baseDF("observation_stay_flag"),
        revCdDF("rtos_2_code"), revCdDF("rtos_2_hierarchy"))
      .where(revCdDF("rtos_2_code").between(8, 27).isNotNull)
      .groupBy(
        baseDF("medical_claim_id"),
        baseDF("medical_claim_detail_id")
      )
      .agg(min(revCdDF("rtos_2_code").alias("min_rtos_2_8_thru_27")),
min(revCdDF("rtos_2_hierarchy").alias("min_rtos_2_8_thru_27_hier")))


This query runs fine:

val df = baseDF.join(mccDF, mccDF("medical_claim_id") <=>
baseDF("medical_claim_id"), "left")
        .join(mccDF, mccDF("medical_claim_detail_id") <=>
baseDF("medical_claim_detail_id"), "left")
      .join(revCdDF, revCdDF("revenue_code_padded_str") <=>
mccDF("mcc_code"), "left")
      .select(baseDF("medical_claim_id"), baseDF("medical_claim_detail_id"),
baseDF("revenue_code"), baseDF("rev_code_distinct_count"),
        baseDF("rtos_1_1_count"), baseDF("rtos_1_0_count"),
baseDF("er_visit_flag"), baseDF("observation_stay_flag"),
        revCdDF("rtos_2_code"), revCdDF("rtos_2_hierarchy"))
      .where(revCdDF("rtos_2_code").between(8, 27).isNotNull)
      .groupBy(
        baseDF("medical_claim_id"),
        baseDF("medical_claim_detail_id")
      )
      .agg(min(revCdDF("rtos_2_code").alias("min_rtos_2_8_thru_27")),
min(revCdDF("rtos_2_hierarchy").alias("min_rtos_2_8_thru_27_hier")))

If I remove the multiple Columns in the join and create a join statement for
each one then the exception goes away.  Is there a better way to join
multiple columns?





--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-left-join-with-multiple-columns-Join-condition-is-missing-or-trivial-Use-the-CROSS-JOIN-syntax-tp21297.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Scala left join with multiple columns Join condition is missing or trivial. Use the CROSS JOIN syntax to allow cartesian products between these relations.

gjohnson35
Thanks Andrew.  I completely missed that. It worked by removing the null safe join condition.