[SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

zero323

Hi,

Can I ask for some clarifications regarding intended behavior of window / TimeWindow?

PySpark documentation states that "Windows in the order of months are not supported". This is further confirmed by the checks in TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).

Surprisingly enough we can pass interval much larger than a month by expressing interval in days or another unit of a higher precision. So this fails:

Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))

while following is accepted:

Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))

with results which look sensible at first glance.

Is it a matter of a faulty validation logic (months will be assigned only if there is a match against years or months https://git.io/vMPdi) or expected behavior and I simply misunderstood the intentions?

-- 
Best,
Maciej
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

Burak Yavuz-2
Hi Maciej,

I believe it would be useful to either fix the documentation or fix the implementation. I'll leave it to the community to comment on. The code right now disallows intervals provided in months and years, because they are not a "consistently" fixed amount of time. A month can be 28, 29, 30, or 31 days. A year is 12 months for sure, but is it 360 days (sometimes used in finance), 365 days or 366 days? 

Therefore we could either:
  1) Allow windowing when intervals are given in days and less, even though it could be 365 days, and fix the documentation.
  2) Explicitly disallow it as there may be a lot of data for a given window, but partial aggregations should help with that.

My thoughts are to go with 1. What do you think?

Best,
Burak

On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz <[hidden email]> wrote:

Hi,

Can I ask for some clarifications regarding intended behavior of window / TimeWindow?

PySpark documentation states that "Windows in the order of months are not supported". This is further confirmed by the checks in TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).

Surprisingly enough we can pass interval much larger than a month by expressing interval in days or another unit of a higher precision. So this fails:

Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))

while following is accepted:

Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))

with results which look sensible at first glance.

Is it a matter of a faulty validation logic (months will be assigned only if there is a match against years or months https://git.io/vMPdi) or expected behavior and I simply misunderstood the intentions?

-- 
Best,
Maciej

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

zero323
Thanks for the response Burak,

As any sane person I try to steer away from the objects which have both calendar and unsafe in their fully qualified names but if there is no bigger picture I missed here I would go with 1 as well. And of course fix the error message. I understand this has been introduced with structured streaming in mind, but it is an useful feature in general, not only in high precision scale. To be honest I would love to see some generalized version which could be used (I mean without hacking) with arbitrary numeric sequence. It could address at least some scenarios in which people try to use window functions without PARTITION BY clause and fail miserably.

Regarding ambiguity... Sticking with days doesn't really resolve the problem, does it? If one were to nitpick it doesn't look like this implementation even touches all the subtleties of DST or leap second.




On 01/18/2017 05:52 PM, Burak Yavuz wrote:
Hi Maciej,

I believe it would be useful to either fix the documentation or fix the implementation. I'll leave it to the community to comment on. The code right now disallows intervals provided in months and years, because they are not a "consistently" fixed amount of time. A month can be 28, 29, 30, or 31 days. A year is 12 months for sure, but is it 360 days (sometimes used in finance), 365 days or 366 days? 

Therefore we could either:
  1) Allow windowing when intervals are given in days and less, even though it could be 365 days, and fix the documentation.
  2) Explicitly disallow it as there may be a lot of data for a given window, but partial aggregations should help with that.

My thoughts are to go with 1. What do you think?

Best,
Burak

On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz <[hidden email]> wrote:

Hi,

Can I ask for some clarifications regarding intended behavior of window / TimeWindow?

PySpark documentation states that "Windows in the order of months are not supported". This is further confirmed by the checks in TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).

Surprisingly enough we can pass interval much larger than a month by expressing interval in days or another unit of a higher precision. So this fails:

Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))

while following is accepted:

Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))

with results which look sensible at first glance.

Is it a matter of a faulty validation logic (months will be assigned only if there is a match against years or months https://git.io/vMPdi) or expected behavior and I simply misunderstood the intentions?

-- 
Best,
Maciej


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

Michael Armbrust
+1, we should just fix the error to explain why months aren't allowed and suggest that you manually specify some number of days.

On Wed, Jan 18, 2017 at 9:52 AM, Maciej Szymkiewicz <[hidden email]> wrote:
Thanks for the response Burak,

As any sane person I try to steer away from the objects which have both calendar and unsafe in their fully qualified names but if there is no bigger picture I missed here I would go with 1 as well. And of course fix the error message. I understand this has been introduced with structured streaming in mind, but it is an useful feature in general, not only in high precision scale. To be honest I would love to see some generalized version which could be used (I mean without hacking) with arbitrary numeric sequence. It could address at least some scenarios in which people try to use window functions without PARTITION BY clause and fail miserably.

Regarding ambiguity... Sticking with days doesn't really resolve the problem, does it? If one were to nitpick it doesn't look like this implementation even touches all the subtleties of DST or leap second.




On 01/18/2017 05:52 PM, Burak Yavuz wrote:
Hi Maciej,

I believe it would be useful to either fix the documentation or fix the implementation. I'll leave it to the community to comment on. The code right now disallows intervals provided in months and years, because they are not a "consistently" fixed amount of time. A month can be 28, 29, 30, or 31 days. A year is 12 months for sure, but is it 360 days (sometimes used in finance), 365 days or 366 days? 

Therefore we could either:
  1) Allow windowing when intervals are given in days and less, even though it could be 365 days, and fix the documentation.
  2) Explicitly disallow it as there may be a lot of data for a given window, but partial aggregations should help with that.

My thoughts are to go with 1. What do you think?

Best,
Burak

On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz <[hidden email]> wrote:

Hi,

Can I ask for some clarifications regarding intended behavior of window / TimeWindow?

PySpark documentation states that "Windows in the order of months are not supported". This is further confirmed by the checks in TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).

Surprisingly enough we can pass interval much larger than a month by expressing interval in days or another unit of a higher precision. So this fails:

Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))

while following is accepted:

Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))

with results which look sensible at first glance.

Is it a matter of a faulty validation logic (months will be assigned only if there is a match against years or months https://git.io/vMPdi) or expected behavior and I simply misunderstood the intentions?

-- 
Best,
Maciej



Loading...