Re: dropDuplicates and watermark in structured streaming
why do you have two watermarks? once you apply the watermark to a column (i.e., "time"), it can be used in all later operations as long as the column is preserved. So the above code should be equivalent to
The right way to think about the watermark threshold is "how late and out of order my data can be". The answer may be different from the window size completely. You may want to calculate 10 minutes windows but your data may come in 5 hour late. So you should define watermark with 5 hour, not 10 minutes.
Btw, on a side note, just so you know, you can use "approx_count_distinct" if you are okay with some approximation.
I'm new to structured streaming. Because the built-in API cannot perform the Count Distinct operation of Window, I want to use dropDuplicates first, and then perform the window count.
But in the process of using, there are two problems: 1. Because it is streaming computing, in the process of deduplication, the state needs to be cleared in time, which requires the cooperation of watermark. Assuming my event time field is consistently
increasing, and I set the watermark to 1 hour, does it mean that the data at 10 o'clock will only be compared in these data from 9 o'clock to 10 o'clock, and the data before 9 o'clock will be cleared ? 2. Because it is window deduplication, I set the watermark before deduplication to the window size.But after deduplication, I need to call withWatermark () again to set the watermark to the real
watermark. Will setting the watermark again take effect?