Spark deletes all existing partitions in SaveMode.Overwrite - Expected behavior ?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark deletes all existing partitions in SaveMode.Overwrite - Expected behavior ?

Yash Sharma
Hi All,
While writing a partitioned data frame as partitioned text files I see that Spark deletes all available partitions while writing few new partitions.

dataDF.write.partitionBy(“year”, “month”, “date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”)

Is this an expected behavior ?

I have a past correction job which would overwrite couple of past partitions based on new arriving data. I would only want to remove those partitions.

Is there a neater way to do that other than:
- Find the partitions
- Delete using Hadoop API's
- Write DF in Append Mode


Cheers
Yash



Reply | Threaded
Open this post in threaded view
|

Re: Spark deletes all existing partitions in SaveMode.Overwrite - Expected behavior ?

nirandap
Hi Yash, 

Yes, AFAIK, that is the expected behavior of the Overwrite mode. 

I think you can use the following approaches if you want to perform a job on each partitions 

Best

On Thu, Jul 7, 2016 at 7:40 AM, Yash Sharma [via Apache Spark Developers List] <[hidden email]> wrote:
Hi All,
While writing a partitioned data frame as partitioned text files I see that Spark deletes all available partitions while writing few new partitions.

dataDF.write.partitionBy(“year”, “month”, “date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”)

Is this an expected behavior ?

I have a past correction job which would overwrite couple of past partitions based on new arriving data. I would only want to remove those partitions.

Is there a neater way to do that other than:
- Find the partitions
- Delete using Hadoop API's
- Write DF in Append Mode


Cheers
Yash






To start a new topic under Apache Spark Developers List, email [hidden email]
To unsubscribe from Apache Spark Developers List, click here.
NAML



--
Reply | Threaded
Open this post in threaded view
|

Re: Spark deletes all existing partitions in SaveMode.Overwrite - Expected behavior ?

golokeshpatra.patra
In reply to this post by Yash Sharma
Adding this simple setting helped me overcome the issue -

*spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
*
My Issue -

In a S3 Folder, I previously had data partitionedBy - *ingestiontime* .
Now I wanted to reprocess this data and partition it by -
businessname & ingestiontime.

Whenever I was writing my dataframe in OverWrite Mode,
All my data, which was present prior to this operation were
TRUNCATED/DELETED.

After setting the above spark configuration,
Only the required Partitions are being truncated and overwritten and all
others stay Intact.

In addition to this, if you have hadoop Trash Enabled, then you might be
able to fetch this lost data back.
For more -
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#File_Deletes_and_Undeletes



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]