spark lacks fault tolerance with dynamic partition overwrite
i wanted to highlight here the issue we are facing with dynamic partition overwrite.
it seems that any tasks that writes to disk using this feature and that need to be retried fails upon retry, leading to a failure for the entire job.
we have seen this issue show up with preemption (task gets killed by pre-emption, and when it gets rescheduled it fails consistently). it can also show up if a hardware issue causes your task to fail, or if you have speculative execution enabled.