Please help the question of repartition for dataset from partiitoned hive table

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Please help the question of repartition for dataset from partiitoned hive table

zhangliyun


Hi All:
   i have a question about repartition api and sparksql partition. I have an table which partition key is day
```
./bin/spark-sql -e "CREATE TABLE t_original_partitioned_spark (cust_id int, loss double) PARTITIONED BY (day STRING) location 'hdfs://localhost:9000/t_original_partitioned_spark'"

```
insert serveral data and now there are 2 partitions as two days ( 2019-05-30 and 2019-05-20)
```
sqlContext.sql("insert into  t_original_partitioned_spark values (30,'0.3','2019-05-30'))
sqlContext.sql("insert into  t_original_partitioned_spark values (20,'0.2','2019-05-20'))


```

now i want to repartition the data to 1 partition as in actual case there maybe too much partitions ,i want to make fewer partitions.

I call repartition api and overwrite the the table. i hope now there is 1 partition but actually there are two partitions when query by "show partitions default.t_original_partitioned_spark"
```

 val df = sqlContext.sql("select * from t_original_partitioned_spark")
df1=df.repartition(1)
df1.write.mode(org.apache.spark.sql.SaveMode.Overwrite).format("seq").insertInto(s"default.t_original_partitioned_spark")

```

my question is the actually partition number is decided by the num of repartition($num) or the hive table partitions if i use both of them?
Best Regards
Kelly Zhang