BroadcastJoin failed on partitioned parquet table

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

BroadcastJoin failed on partitioned parquet table

白也诗无敌
Hi, guys:
     I'm using Spark1.6.2.
     There are two tables and the small one is a partitioned parquet table;
     The total size of the small table is 1000M but each partition only 1M;
     When I set spark.sql.autoBroadcastJoinThreshold to 50m ​and join the two tables with single partition, I get the SortMergeJoin physical plan.
     I have made some try and it has something to do with the partition pruning:
     1. check the physical plan, and all of the partitions of the small table are added in.
      It seems like https://issues.apache.org/jira/browse/SPARK-16980

     2. set spark.sql.hive.convertMetastoreParquet=false
    ​The pruning is success, but still get SortMergeJoin because the code HiveMetastoreCatalog.scala
      @transient override lazy val statistics: Statistics = Statistics(
  sizeInBytes = {
val totalSize = hiveQlTable.getParameters.get(StatsSetupConst.TOTAL_SIZE)
val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
    The total size of the table not the single partition.

    How can I fix this without patches? Or Is there a patch for SPARK1.6 about SPARK-16980.
  

best regards!
Jerry
Reply | Threaded
Open this post in threaded view
|

回复:BroadcastJoin failed on partitioned parquet table

白也诗无敌
Besides I have tried ANALYZE statement. It has no use cause I need the single partition but get the total table size by hive parameter 'totalSize' or 'rawSize' and so on




Hi, guys:
     I'm using Spark1.6.2.
     There are two tables and the small one is a partitioned parquet table;
     The total size of the small table is 1000M but each partition only 1M;
     When I set spark.sql.autoBroadcastJoinThreshold to 50m ​and join the two tables with single partition, I get the SortMergeJoin physical plan.
     I have made some try and it has something to do with the partition pruning:
     1. check the physical plan, and all of the partitions of the small table are added in.
      It seems like https://issues.apache.org/jira/browse/SPARK-16980

     2. set spark.sql.hive.convertMetastoreParquet=false
    ​The pruning is success, but still get SortMergeJoin because the code HiveMetastoreCatalog.scala
      @transient override lazy val statistics: Statistics = Statistics(
  sizeInBytes = {
val totalSize = hiveQlTable.getParameters.get(StatsSetupConst.TOTAL_SIZE)
val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
    The total size of the table not the single partition.

    How can I fix this without patches? Or Is there a patch for SPARK1.6 about SPARK-16980.
 

best regards!
Jerry