Structured Streaming with S3 file source duplicates data because of eventual consistency

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Structured Streaming with S3 file source duplicates data because of eventual consistency

Yash Sharma
Hi Team,
I have been using Structured Streaming with the S3 data source but I am seeing it duplicate the data intermittently. New run seem to fix it, but the duplication happens ~10% of time. The ratio increases with more number of files in the source. Investigating more, I see this is clearly an issue with S3's eventual consistency, and spark re-processes the task twice, because its not able to verify if the task successfully wrote the output of completed task.

I have added all the details of investigation in the ticket below with code and error logs.Is there a way we can address this issue and is there anything I can help out with.


Cheers