Static partitioning in partitionBy()

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Static partitioning in partitionBy()

Shubham Chaurasia
Hi All,

Is there a way I can provide static partitions in partitionBy()?

Like:
df.write.mode("overwrite").format("MyDataSource").partitionBy("c=c1").save

Above code gives following error as it tries to find column `c=c1` in df.

org.apache.spark.sql.AnalysisException: Partition column `c=c1` not found in schema struct<a:string,b:string,c:string>;

Thanks,
Shubham
Reply | Threaded
Open this post in threaded view
|

Re: Static partitioning in partitionBy()

Shubham Chaurasia
Thanks

On Wed, May 8, 2019 at 10:36 AM Felix Cheung <[hidden email]> wrote:
You could

df.filter(col(“c”) = “c1”).write().partitionBy(“c”).save

It could get some data skew problem but might work for you




From: Burak Yavuz <[hidden email]>
Sent: Tuesday, May 7, 2019 9:35:10 AM
To: Shubham Chaurasia
Cc: dev; [hidden email]
Subject: Re: Static partitioning in partitionBy()
 
It depends on the data source. Delta Lake (https://delta.io) allows you to do it with the .option("replaceWhere", "c = c1"). With other file formats, you can write directly into the partition directory (tablePath/c=c1), but you lose atomicity.

On Tue, May 7, 2019, 6:36 AM Shubham Chaurasia <[hidden email]> wrote:
Hi All,

Is there a way I can provide static partitions in partitionBy()?

Like:
df.write.mode("overwrite").format("MyDataSource").partitionBy("c=c1").save

Above code gives following error as it tries to find column `c=c1` in df.

org.apache.spark.sql.AnalysisException: Partition column `c=c1` not found in schema struct<a:string,b:string,c:string>;

Thanks,
Shubham