Allow average out of a Date

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Allow average out of a Date

Driesprong, Fokko
Hi all,

Personally, I'm a big fan of the .summary() function to compute statistics of a dataframe. I often use this for debugging pipelines, and check what the impact of the RDD is after changing code.

I've noticed that not all datatypes are in this summary. Currently, there is a list of types that allowed to be included in the summary, and I love to extend that list.

The first one is the date type. It is important to define this together with the community, and that we get consensus, as this will be part of the public API. Changing this will be costly (or maybe impossible) to do.

I've checked what other DBMS'es do with averages out of dates:

Postgres

Unsupported:

postgres@366ecc8a0fb9:/$ psql
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
ERROR:  cannot cast type date to numeric
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
               ^

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
ERROR:  cannot cast type date to integer
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
               ^

The way to get the epoch in days is:

postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
date_part 
-----------
    18422
(1 row)

MySQL

Converts to a YYYYMMDD format:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
+---------------------------------------------+
| CAST(CAST('2020-01-01' AS DATE) AS decimal) |
+---------------------------------------------+
|                                    20200101 |
+---------------------------------------------+
1 row in set (0.00 sec)
However, converting to an int, isn't allowed:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'int)' at line 1

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'bigint)' at line 1
Bigquery

Unsupported:

image.png

Excel

Converts it to the days since epoch. This feels weird, but I can see it, as it is being used as a physical format internally in many data formats.

image.png

For me, returning a Date as the output of avg(date) seems like a logical choice. Internally it is handled as dates since epoch, which makes sense as well:


Any further thoughts on this before moving forward?

Kind regards, Fokko


Reply | Threaded
Open this post in threaded view
|

Re: Allow average out of a Date

gregrahn
Fokko-

I can certainly understand the ask for this behavior and it seems simple and/or straightforward at face value -- as R's Date Class behaves this way, however, I think there are several challenges involved here, as AFAIK, there seems to be no well defined way SQL systems do this implicitly today -- and I think there is a clear explanation of why.

As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if AVG() works on DATE, then SUM() (and the plus operator) would need to work as well -- but what is the definition of SUM() for a list of DATEs or DATE + DATE? There isn't any such definition in the SQL world. And take a very simple example -- what's the AVG() of today and yesterday? If you represent these dates as their Julian integer value, the average of two consecutive integers is a floating point number between them -- so now what?  Should it be ceil() or floor()? Round down or round up? Should it be yesterday or today?  It seems very challenging to deal with this in a formal and standardized way.

It's my understanding that it's this ambiguity, and the need for an intermediate numerical representation like the Julian Day Number, that limits this from being an implicit SQL data type conversion operation and why users are forced to choose their own explicit conversion (to Julian date, for example) in order to do such mathematical operations. As we see in the SQL:2016 standard cast specification and conversion table (below) -- there is no standardized way to implicitly cast between DATE and numeric types because there is not just one numerical representation of a DATE.

If we look back at the use case from SPARK-10520 which is about R's summary function on a vector of dates -- (and I don't claim to be an R internals expert) but maybe it makes more sense to implement R's Date Class behavior in SparkR vs more generically across multiple languages and types.  If possible, this would seem to yield the desired behavior for the use case in question, but not broaden the requirement so wide that it becomes too challenging to implement and conflicts with other well defined standards, like ANSI SQL:2016.


Screen Shot 2020-08-21 at 12.00.01 PM.png


For reference:




On Wed, Aug 19, 2020 at 9:34 PM Driesprong, Fokko <[hidden email]> wrote:
Hi all,

Personally, I'm a big fan of the .summary() function to compute statistics of a dataframe. I often use this for debugging pipelines, and check what the impact of the RDD is after changing code.

I've noticed that not all datatypes are in this summary. Currently, there is a list of types that allowed to be included in the summary, and I love to extend that list.

The first one is the date type. It is important to define this together with the community, and that we get consensus, as this will be part of the public API. Changing this will be costly (or maybe impossible) to do.

I've checked what other DBMS'es do with averages out of dates:

Postgres

Unsupported:

postgres@366ecc8a0fb9:/$ psql
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
ERROR:  cannot cast type date to numeric
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
               ^

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
ERROR:  cannot cast type date to integer
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
               ^

The way to get the epoch in days is:

postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
date_part 
-----------
    18422
(1 row)

MySQL

Converts to a YYYYMMDD format:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
+---------------------------------------------+
| CAST(CAST('2020-01-01' AS DATE) AS decimal) |
+---------------------------------------------+
|                                    20200101 |
+---------------------------------------------+
1 row in set (0.00 sec)
However, converting to an int, isn't allowed:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'int)' at line 1

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'bigint)' at line 1
Bigquery

Unsupported:

image.png

Excel

Converts it to the days since epoch. This feels weird, but I can see it, as it is being used as a physical format internally in many data formats.

image.png

For me, returning a Date as the output of avg(date) seems like a logical choice. Internally it is handled as dates since epoch, which makes sense as well:


Any further thoughts on this before moving forward?

Kind regards, Fokko


Reply | Threaded
Open this post in threaded view
|

Re: Allow average out of a Date

Driesprong, Fokko
Thanks, Greg for your elaborate response, and sharing the table.

The open PR is targetting SPARK-10520: https://github.com/apache/spark/pull/28554 Before opening a new ticket, I noticed that this was already an open issue. However, this will make it useable for Python/Java/Scala as well.

Not being able to sum dates is still an open issue. For the current solution I've taken inspiration from the timestamp implementation in Spark:

CREATE VIEW timestamps
    AS SELECT explode(array(
                  CAST('2020-07-01 12:00:00' AS timestamp),
                  CAST('2020-07-02 18:00:00' AS timestamp)
              )) AS a;

SELECT SUM(a), SUM(a) / COUNT(a), AVG(a) FROM timestamps;

Sum:         3.1873032E9
Sum / Count: 1.5936516E9
Avg:         1.5936516E9

Using Spark 2.4.5. The datatype is implicitly converted a DoubleType (seconds since epoch), while I believe this should be an explicit choice. I've started the discussion in a separate PR.

I fully agree that summing/subtracting dates/timestamps don't look very sensible. However, taking an average can be of use. For example, if you're sessionising user activity or log entries, you might want to have a single date/timestamp to represent the session. Taking an average (or a median?) would make sense in such a case. 

> As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if AVG() works on DATE, then SUM() (and the plus operator) would need to work as well -- but what is the definition of SUM() for a list of DATEs or DATE + DATE? There isn't any such definition in the SQL world. And take a very simple example -- what's the AVG() of today and yesterday? If you represent these dates as their Julian integer value, the average of two consecutive integers is a floating point number between them -- so now what?  Should it be ceil() or floor()? Round down or round up? Should it be yesterday or today?  It seems very challenging to deal with this in a formal and standardized way.

I would not bother the end-user with all these questions, as I would expect in most cases this isn't an issue. If the user wants to have full control, it can first cast the date to a double, which gives days since epoch. Then it can have full control over possible rounding etc. Next step would be taking the std-dev out of a Date column, I would suggest using an interval type to represent the deviation, but let's take one step at the time.

> It's my understanding that it's this ambiguity, and the need for an intermediate numerical representation like the Julian Day Number, that limits this from being an implicit SQL data type conversion operation and why users are forced to choose their own explicit conversion (to Julian date, for example) in order to do such mathematical operations. As we see in the SQL:2016 standard cast specification and conversion table (below) -- there is no standardized way to implicitly cast between DATE and numeric types because there is not just one numerical representation of a DATE.

For now, I've decided to stick with the days since Unix epoch, as it seems to be popular among file formats:


> If possible, this would seem to yield the desired behavior for the use case in question, but not broaden the requirement so wide that it becomes too challenging to implement and conflicts with other well defined standards, like ANSI SQL:2016.

I see your point, and I don't like to reinvent the standards. But the ANSI standard is clearly too limited, as you might also want to do things that you normally wouldn't do in a database (for example, when it comes to feature generation or the abovementioned sessionising). While looking at other RDBM'ses, it is clear that there isn't a uniform way of doing this, as shown in my initial mail. For example, you won't be able to sum the cast output (YYYYMMDD) format of MySQL, because you might get invalid months/days in the output (oktober+oktober=month 20?).

Currently, it isn't possible to sum dates, if this is something that you want to have in there, I can add it to the PR. For the .summary() function it isn't required. The workaround is to cast it to a double first and then sum the days since epoch.

Kind regards, Fokko


Op vr 21 aug. 2020 om 18:13 schreef Greg Rahn <[hidden email]>:
Fokko-

I can certainly understand the ask for this behavior and it seems simple and/or straightforward at face value -- as R's Date Class behaves this way, however, I think there are several challenges involved here, as AFAIK, there seems to be no well defined way SQL systems do this implicitly today -- and I think there is a clear explanation of why.

As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if AVG() works on DATE, then SUM() (and the plus operator) would need to work as well -- but what is the definition of SUM() for a list of DATEs or DATE + DATE? There isn't any such definition in the SQL world. And take a very simple example -- what's the AVG() of today and yesterday? If you represent these dates as their Julian integer value, the average of two consecutive integers is a floating point number between them -- so now what?  Should it be ceil() or floor()? Round down or round up? Should it be yesterday or today?  It seems very challenging to deal with this in a formal and standardized way.

It's my understanding that it's this ambiguity, and the need for an intermediate numerical representation like the Julian Day Number, that limits this from being an implicit SQL data type conversion operation and why users are forced to choose their own explicit conversion (to Julian date, for example) in order to do such mathematical operations. As we see in the SQL:2016 standard cast specification and conversion table (below) -- there is no standardized way to implicitly cast between DATE and numeric types because there is not just one numerical representation of a DATE.

If we look back at the use case from SPARK-10520 which is about R's summary function on a vector of dates -- (and I don't claim to be an R internals expert) but maybe it makes more sense to implement R's Date Class behavior in SparkR vs more generically across multiple languages and types.  If possible, this would seem to yield the desired behavior for the use case in question, but not broaden the requirement so wide that it becomes too challenging to implement and conflicts with other well defined standards, like ANSI SQL:2016.


Screen Shot 2020-08-21 at 12.00.01 PM.png


For reference:




On Wed, Aug 19, 2020 at 9:34 PM Driesprong, Fokko <[hidden email]> wrote:
Hi all,

Personally, I'm a big fan of the .summary() function to compute statistics of a dataframe. I often use this for debugging pipelines, and check what the impact of the RDD is after changing code.

I've noticed that not all datatypes are in this summary. Currently, there is a list of types that allowed to be included in the summary, and I love to extend that list.

The first one is the date type. It is important to define this together with the community, and that we get consensus, as this will be part of the public API. Changing this will be costly (or maybe impossible) to do.

I've checked what other DBMS'es do with averages out of dates:

Postgres

Unsupported:

postgres@366ecc8a0fb9:/$ psql
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
ERROR:  cannot cast type date to numeric
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
               ^

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
ERROR:  cannot cast type date to integer
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
               ^

The way to get the epoch in days is:

postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
date_part 
-----------
    18422
(1 row)

MySQL

Converts to a YYYYMMDD format:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
+---------------------------------------------+
| CAST(CAST('2020-01-01' AS DATE) AS decimal) |
+---------------------------------------------+
|                                    20200101 |
+---------------------------------------------+
1 row in set (0.00 sec)
However, converting to an int, isn't allowed:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'int)' at line 1

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'bigint)' at line 1
Bigquery

Unsupported:

image.png

Excel

Converts it to the days since epoch. This feels weird, but I can see it, as it is being used as a physical format internally in many data formats.

image.png

For me, returning a Date as the output of avg(date) seems like a logical choice. Internally it is handled as dates since epoch, which makes sense as well:


Any further thoughts on this before moving forward?

Kind regards, Fokko


Reply | Threaded
Open this post in threaded view
|

Re: Allow average out of a Date

gregrahn
I think your "working" timestamp example exposes the crux of the implicit casting problem -- and PR#28554 points to one of the other prominent issues -- function return types.  The current implementation seems to be problematic to me because there should not be implicit conversions of timestamp types to double because there is no single definition of how to represent a timestamp in double format.  It's fine to have explicit casts or helper functions -- because then the user is consciously choosing the "units" being used for a given numerical representation.

If the goal here is for the summary to work on dates, can't we just teach that function how to do this using existing functionality w/o exposing all this internal numeric representation for dates? 

Just to be clear -- I have no objection for the objective here -- I just have concerns about the proposed implementation and the casting impacts.

On Sat, Aug 22, 2020 at 12:28 PM Driesprong, Fokko <[hidden email]> wrote:
Thanks, Greg for your elaborate response, and sharing the table.

The open PR is targetting SPARK-10520: https://github.com/apache/spark/pull/28554 Before opening a new ticket, I noticed that this was already an open issue. However, this will make it useable for Python/Java/Scala as well.

Not being able to sum dates is still an open issue. For the current solution I've taken inspiration from the timestamp implementation in Spark:

CREATE VIEW timestamps
    AS SELECT explode(array(
                  CAST('2020-07-01 12:00:00' AS timestamp),
                  CAST('2020-07-02 18:00:00' AS timestamp)
              )) AS a;

SELECT SUM(a), SUM(a) / COUNT(a), AVG(a) FROM timestamps;

Sum:         3.1873032E9
Sum / Count: 1.5936516E9
Avg:         1.5936516E9

Using Spark 2.4.5. The datatype is implicitly converted a DoubleType (seconds since epoch), while I believe this should be an explicit choice. I've started the discussion in a separate PR.

I fully agree that summing/subtracting dates/timestamps don't look very sensible. However, taking an average can be of use. For example, if you're sessionising user activity or log entries, you might want to have a single date/timestamp to represent the session. Taking an average (or a median?) would make sense in such a case. 

> As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if AVG() works on DATE, then SUM() (and the plus operator) would need to work as well -- but what is the definition of SUM() for a list of DATEs or DATE + DATE? There isn't any such definition in the SQL world. And take a very simple example -- what's the AVG() of today and yesterday? If you represent these dates as their Julian integer value, the average of two consecutive integers is a floating point number between them -- so now what?  Should it be ceil() or floor()? Round down or round up? Should it be yesterday or today?  It seems very challenging to deal with this in a formal and standardized way.

I would not bother the end-user with all these questions, as I would expect in most cases this isn't an issue. If the user wants to have full control, it can first cast the date to a double, which gives days since epoch. Then it can have full control over possible rounding etc. Next step would be taking the std-dev out of a Date column, I would suggest using an interval type to represent the deviation, but let's take one step at the time.

> It's my understanding that it's this ambiguity, and the need for an intermediate numerical representation like the Julian Day Number, that limits this from being an implicit SQL data type conversion operation and why users are forced to choose their own explicit conversion (to Julian date, for example) in order to do such mathematical operations. As we see in the SQL:2016 standard cast specification and conversion table (below) -- there is no standardized way to implicitly cast between DATE and numeric types because there is not just one numerical representation of a DATE.

For now, I've decided to stick with the days since Unix epoch, as it seems to be popular among file formats:


> If possible, this would seem to yield the desired behavior for the use case in question, but not broaden the requirement so wide that it becomes too challenging to implement and conflicts with other well defined standards, like ANSI SQL:2016.

I see your point, and I don't like to reinvent the standards. But the ANSI standard is clearly too limited, as you might also want to do things that you normally wouldn't do in a database (for example, when it comes to feature generation or the abovementioned sessionising). While looking at other RDBM'ses, it is clear that there isn't a uniform way of doing this, as shown in my initial mail. For example, you won't be able to sum the cast output (YYYYMMDD) format of MySQL, because you might get invalid months/days in the output (oktober+oktober=month 20?).

Currently, it isn't possible to sum dates, if this is something that you want to have in there, I can add it to the PR. For the .summary() function it isn't required. The workaround is to cast it to a double first and then sum the days since epoch.

Kind regards, Fokko


Op vr 21 aug. 2020 om 18:13 schreef Greg Rahn <[hidden email]>:
Fokko-

I can certainly understand the ask for this behavior and it seems simple and/or straightforward at face value -- as R's Date Class behaves this way, however, I think there are several challenges involved here, as AFAIK, there seems to be no well defined way SQL systems do this implicitly today -- and I think there is a clear explanation of why.

As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if AVG() works on DATE, then SUM() (and the plus operator) would need to work as well -- but what is the definition of SUM() for a list of DATEs or DATE + DATE? There isn't any such definition in the SQL world. And take a very simple example -- what's the AVG() of today and yesterday? If you represent these dates as their Julian integer value, the average of two consecutive integers is a floating point number between them -- so now what?  Should it be ceil() or floor()? Round down or round up? Should it be yesterday or today?  It seems very challenging to deal with this in a formal and standardized way.

It's my understanding that it's this ambiguity, and the need for an intermediate numerical representation like the Julian Day Number, that limits this from being an implicit SQL data type conversion operation and why users are forced to choose their own explicit conversion (to Julian date, for example) in order to do such mathematical operations. As we see in the SQL:2016 standard cast specification and conversion table (below) -- there is no standardized way to implicitly cast between DATE and numeric types because there is not just one numerical representation of a DATE.

If we look back at the use case from SPARK-10520 which is about R's summary function on a vector of dates -- (and I don't claim to be an R internals expert) but maybe it makes more sense to implement R's Date Class behavior in SparkR vs more generically across multiple languages and types.  If possible, this would seem to yield the desired behavior for the use case in question, but not broaden the requirement so wide that it becomes too challenging to implement and conflicts with other well defined standards, like ANSI SQL:2016.


Screen Shot 2020-08-21 at 12.00.01 PM.png


For reference:




On Wed, Aug 19, 2020 at 9:34 PM Driesprong, Fokko <[hidden email]> wrote:
Hi all,

Personally, I'm a big fan of the .summary() function to compute statistics of a dataframe. I often use this for debugging pipelines, and check what the impact of the RDD is after changing code.

I've noticed that not all datatypes are in this summary. Currently, there is a list of types that allowed to be included in the summary, and I love to extend that list.

The first one is the date type. It is important to define this together with the community, and that we get consensus, as this will be part of the public API. Changing this will be costly (or maybe impossible) to do.

I've checked what other DBMS'es do with averages out of dates:

Postgres

Unsupported:

postgres@366ecc8a0fb9:/$ psql
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
ERROR:  cannot cast type date to numeric
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
               ^

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
ERROR:  cannot cast type date to integer
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
               ^

The way to get the epoch in days is:

postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
date_part 
-----------
    18422
(1 row)

MySQL

Converts to a YYYYMMDD format:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
+---------------------------------------------+
| CAST(CAST('2020-01-01' AS DATE) AS decimal) |
+---------------------------------------------+
|                                    20200101 |
+---------------------------------------------+
1 row in set (0.00 sec)
However, converting to an int, isn't allowed:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'int)' at line 1

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'bigint)' at line 1
Bigquery

Unsupported:

image.png

Excel

Converts it to the days since epoch. This feels weird, but I can see it, as it is being used as a physical format internally in many data formats.

image.png

For me, returning a Date as the output of avg(date) seems like a logical choice. Internally it is handled as dates since epoch, which makes sense as well:


Any further thoughts on this before moving forward?

Kind regards, Fokko


Reply | Threaded
Open this post in threaded view
|

Re: Allow average out of a Date

cloud0fan
> can't we just teach that function how to do this using existing functionality w/o exposing all this internal numeric representation for dates?

+1, we don't need to complicate the type system and claim that date/timestamp can do avg/sum/etc. , we can just turn date/timestamp to numeric, do whatever you want, and turn the result back to date/timestamp, in the `summary` function.

On Mon, Aug 24, 2020 at 3:37 AM Greg Rahn <[hidden email]> wrote:
I think your "working" timestamp example exposes the crux of the implicit casting problem -- and PR#28554 points to one of the other prominent issues -- function return types.  The current implementation seems to be problematic to me because there should not be implicit conversions of timestamp types to double because there is no single definition of how to represent a timestamp in double format.  It's fine to have explicit casts or helper functions -- because then the user is consciously choosing the "units" being used for a given numerical representation.

If the goal here is for the summary to work on dates, can't we just teach that function how to do this using existing functionality w/o exposing all this internal numeric representation for dates? 

Just to be clear -- I have no objection for the objective here -- I just have concerns about the proposed implementation and the casting impacts.

On Sat, Aug 22, 2020 at 12:28 PM Driesprong, Fokko <[hidden email]> wrote:
Thanks, Greg for your elaborate response, and sharing the table.

The open PR is targetting SPARK-10520: https://github.com/apache/spark/pull/28554 Before opening a new ticket, I noticed that this was already an open issue. However, this will make it useable for Python/Java/Scala as well.

Not being able to sum dates is still an open issue. For the current solution I've taken inspiration from the timestamp implementation in Spark:

CREATE VIEW timestamps
    AS SELECT explode(array(
                  CAST('2020-07-01 12:00:00' AS timestamp),
                  CAST('2020-07-02 18:00:00' AS timestamp)
              )) AS a;

SELECT SUM(a), SUM(a) / COUNT(a), AVG(a) FROM timestamps;

Sum:         3.1873032E9
Sum / Count: 1.5936516E9
Avg:         1.5936516E9

Using Spark 2.4.5. The datatype is implicitly converted a DoubleType (seconds since epoch), while I believe this should be an explicit choice. I've started the discussion in a separate PR.

I fully agree that summing/subtracting dates/timestamps don't look very sensible. However, taking an average can be of use. For example, if you're sessionising user activity or log entries, you might want to have a single date/timestamp to represent the session. Taking an average (or a median?) would make sense in such a case. 

> As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if AVG() works on DATE, then SUM() (and the plus operator) would need to work as well -- but what is the definition of SUM() for a list of DATEs or DATE + DATE? There isn't any such definition in the SQL world. And take a very simple example -- what's the AVG() of today and yesterday? If you represent these dates as their Julian integer value, the average of two consecutive integers is a floating point number between them -- so now what?  Should it be ceil() or floor()? Round down or round up? Should it be yesterday or today?  It seems very challenging to deal with this in a formal and standardized way.

I would not bother the end-user with all these questions, as I would expect in most cases this isn't an issue. If the user wants to have full control, it can first cast the date to a double, which gives days since epoch. Then it can have full control over possible rounding etc. Next step would be taking the std-dev out of a Date column, I would suggest using an interval type to represent the deviation, but let's take one step at the time.

> It's my understanding that it's this ambiguity, and the need for an intermediate numerical representation like the Julian Day Number, that limits this from being an implicit SQL data type conversion operation and why users are forced to choose their own explicit conversion (to Julian date, for example) in order to do such mathematical operations. As we see in the SQL:2016 standard cast specification and conversion table (below) -- there is no standardized way to implicitly cast between DATE and numeric types because there is not just one numerical representation of a DATE.

For now, I've decided to stick with the days since Unix epoch, as it seems to be popular among file formats:


> If possible, this would seem to yield the desired behavior for the use case in question, but not broaden the requirement so wide that it becomes too challenging to implement and conflicts with other well defined standards, like ANSI SQL:2016.

I see your point, and I don't like to reinvent the standards. But the ANSI standard is clearly too limited, as you might also want to do things that you normally wouldn't do in a database (for example, when it comes to feature generation or the abovementioned sessionising). While looking at other RDBM'ses, it is clear that there isn't a uniform way of doing this, as shown in my initial mail. For example, you won't be able to sum the cast output (YYYYMMDD) format of MySQL, because you might get invalid months/days in the output (oktober+oktober=month 20?).

Currently, it isn't possible to sum dates, if this is something that you want to have in there, I can add it to the PR. For the .summary() function it isn't required. The workaround is to cast it to a double first and then sum the days since epoch.

Kind regards, Fokko


Op vr 21 aug. 2020 om 18:13 schreef Greg Rahn <[hidden email]>:
Fokko-

I can certainly understand the ask for this behavior and it seems simple and/or straightforward at face value -- as R's Date Class behaves this way, however, I think there are several challenges involved here, as AFAIK, there seems to be no well defined way SQL systems do this implicitly today -- and I think there is a clear explanation of why.

As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if AVG() works on DATE, then SUM() (and the plus operator) would need to work as well -- but what is the definition of SUM() for a list of DATEs or DATE + DATE? There isn't any such definition in the SQL world. And take a very simple example -- what's the AVG() of today and yesterday? If you represent these dates as their Julian integer value, the average of two consecutive integers is a floating point number between them -- so now what?  Should it be ceil() or floor()? Round down or round up? Should it be yesterday or today?  It seems very challenging to deal with this in a formal and standardized way.

It's my understanding that it's this ambiguity, and the need for an intermediate numerical representation like the Julian Day Number, that limits this from being an implicit SQL data type conversion operation and why users are forced to choose their own explicit conversion (to Julian date, for example) in order to do such mathematical operations. As we see in the SQL:2016 standard cast specification and conversion table (below) -- there is no standardized way to implicitly cast between DATE and numeric types because there is not just one numerical representation of a DATE.

If we look back at the use case from SPARK-10520 which is about R's summary function on a vector of dates -- (and I don't claim to be an R internals expert) but maybe it makes more sense to implement R's Date Class behavior in SparkR vs more generically across multiple languages and types.  If possible, this would seem to yield the desired behavior for the use case in question, but not broaden the requirement so wide that it becomes too challenging to implement and conflicts with other well defined standards, like ANSI SQL:2016.


Screen Shot 2020-08-21 at 12.00.01 PM.png


For reference:




On Wed, Aug 19, 2020 at 9:34 PM Driesprong, Fokko <[hidden email]> wrote:
Hi all,

Personally, I'm a big fan of the .summary() function to compute statistics of a dataframe. I often use this for debugging pipelines, and check what the impact of the RDD is after changing code.

I've noticed that not all datatypes are in this summary. Currently, there is a list of types that allowed to be included in the summary, and I love to extend that list.

The first one is the date type. It is important to define this together with the community, and that we get consensus, as this will be part of the public API. Changing this will be costly (or maybe impossible) to do.

I've checked what other DBMS'es do with averages out of dates:

Postgres

Unsupported:

postgres@366ecc8a0fb9:/$ psql
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
ERROR:  cannot cast type date to numeric
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
               ^

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
ERROR:  cannot cast type date to integer
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
               ^

The way to get the epoch in days is:

postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
date_part 
-----------
    18422
(1 row)

MySQL

Converts to a YYYYMMDD format:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
+---------------------------------------------+
| CAST(CAST('2020-01-01' AS DATE) AS decimal) |
+---------------------------------------------+
|                                    20200101 |
+---------------------------------------------+
1 row in set (0.00 sec)
However, converting to an int, isn't allowed:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'int)' at line 1

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'bigint)' at line 1
Bigquery

Unsupported:

image.png

Excel

Converts it to the days since epoch. This feels weird, but I can see it, as it is being used as a physical format internally in many data formats.

image.png

For me, returning a Date as the output of avg(date) seems like a logical choice. Internally it is handled as dates since epoch, which makes sense as well:


Any further thoughts on this before moving forward?

Kind regards, Fokko


Reply | Threaded
Open this post in threaded view
|

Re: Allow average out of a Date

Driesprong, Fokko
> +1, we don't need to complicate the type system and claim that date/timestamp can do avg/sum/etc. , we can just turn date/timestamp to numeric, do whatever you want, and turn the result back to date/timestamp, in the `summary` function.

+1 This is exactly my proposal in: [SPARK-10520][SQL] Allow average out of DateType

This will return a date out of avg(date), so the user isn't confused because it gets to see an arbitrary number (in this case, days since epoch).

As a sidestep, I've illustrated the behaviour of avg(datetime), which returns the seconds since epoch, but should als return a datetime in my opinion: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp However, changing this means changing the public api :'( Feel free to chim in on the PR if you're interested.

My proposal is to keep the input and output type the same.

Op ma 24 aug. 2020 om 05:04 schreef Wenchen Fan <[hidden email]>:
> can't we just teach that function how to do this using existing functionality w/o exposing all this internal numeric representation for dates?

+1, we don't need to complicate the type system and claim that date/timestamp can do avg/sum/etc. , we can just turn date/timestamp to numeric, do whatever you want, and turn the result back to date/timestamp, in the `summary` function.

On Mon, Aug 24, 2020 at 3:37 AM Greg Rahn <[hidden email]> wrote:
I think your "working" timestamp example exposes the crux of the implicit casting problem -- and PR#28554 points to one of the other prominent issues -- function return types.  The current implementation seems to be problematic to me because there should not be implicit conversions of timestamp types to double because there is no single definition of how to represent a timestamp in double format.  It's fine to have explicit casts or helper functions -- because then the user is consciously choosing the "units" being used for a given numerical representation.

If the goal here is for the summary to work on dates, can't we just teach that function how to do this using existing functionality w/o exposing all this internal numeric representation for dates? 

Just to be clear -- I have no objection for the objective here -- I just have concerns about the proposed implementation and the casting impacts.

On Sat, Aug 22, 2020 at 12:28 PM Driesprong, Fokko <[hidden email]> wrote:
Thanks, Greg for your elaborate response, and sharing the table.

The open PR is targetting SPARK-10520: https://github.com/apache/spark/pull/28554 Before opening a new ticket, I noticed that this was already an open issue. However, this will make it useable for Python/Java/Scala as well.

Not being able to sum dates is still an open issue. For the current solution I've taken inspiration from the timestamp implementation in Spark:

CREATE VIEW timestamps
    AS SELECT explode(array(
                  CAST('2020-07-01 12:00:00' AS timestamp),
                  CAST('2020-07-02 18:00:00' AS timestamp)
              )) AS a;

SELECT SUM(a), SUM(a) / COUNT(a), AVG(a) FROM timestamps;

Sum:         3.1873032E9
Sum / Count: 1.5936516E9
Avg:         1.5936516E9

Using Spark 2.4.5. The datatype is implicitly converted a DoubleType (seconds since epoch), while I believe this should be an explicit choice. I've started the discussion in a separate PR.

I fully agree that summing/subtracting dates/timestamps don't look very sensible. However, taking an average can be of use. For example, if you're sessionising user activity or log entries, you might want to have a single date/timestamp to represent the session. Taking an average (or a median?) would make sense in such a case. 

> As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if AVG() works on DATE, then SUM() (and the plus operator) would need to work as well -- but what is the definition of SUM() for a list of DATEs or DATE + DATE? There isn't any such definition in the SQL world. And take a very simple example -- what's the AVG() of today and yesterday? If you represent these dates as their Julian integer value, the average of two consecutive integers is a floating point number between them -- so now what?  Should it be ceil() or floor()? Round down or round up? Should it be yesterday or today?  It seems very challenging to deal with this in a formal and standardized way.

I would not bother the end-user with all these questions, as I would expect in most cases this isn't an issue. If the user wants to have full control, it can first cast the date to a double, which gives days since epoch. Then it can have full control over possible rounding etc. Next step would be taking the std-dev out of a Date column, I would suggest using an interval type to represent the deviation, but let's take one step at the time.

> It's my understanding that it's this ambiguity, and the need for an intermediate numerical representation like the Julian Day Number, that limits this from being an implicit SQL data type conversion operation and why users are forced to choose their own explicit conversion (to Julian date, for example) in order to do such mathematical operations. As we see in the SQL:2016 standard cast specification and conversion table (below) -- there is no standardized way to implicitly cast between DATE and numeric types because there is not just one numerical representation of a DATE.

For now, I've decided to stick with the days since Unix epoch, as it seems to be popular among file formats:


> If possible, this would seem to yield the desired behavior for the use case in question, but not broaden the requirement so wide that it becomes too challenging to implement and conflicts with other well defined standards, like ANSI SQL:2016.

I see your point, and I don't like to reinvent the standards. But the ANSI standard is clearly too limited, as you might also want to do things that you normally wouldn't do in a database (for example, when it comes to feature generation or the abovementioned sessionising). While looking at other RDBM'ses, it is clear that there isn't a uniform way of doing this, as shown in my initial mail. For example, you won't be able to sum the cast output (YYYYMMDD) format of MySQL, because you might get invalid months/days in the output (oktober+oktober=month 20?).

Currently, it isn't possible to sum dates, if this is something that you want to have in there, I can add it to the PR. For the .summary() function it isn't required. The workaround is to cast it to a double first and then sum the days since epoch.

Kind regards, Fokko


Op vr 21 aug. 2020 om 18:13 schreef Greg Rahn <[hidden email]>:
Fokko-

I can certainly understand the ask for this behavior and it seems simple and/or straightforward at face value -- as R's Date Class behaves this way, however, I think there are several challenges involved here, as AFAIK, there seems to be no well defined way SQL systems do this implicitly today -- and I think there is a clear explanation of why.

As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if AVG() works on DATE, then SUM() (and the plus operator) would need to work as well -- but what is the definition of SUM() for a list of DATEs or DATE + DATE? There isn't any such definition in the SQL world. And take a very simple example -- what's the AVG() of today and yesterday? If you represent these dates as their Julian integer value, the average of two consecutive integers is a floating point number between them -- so now what?  Should it be ceil() or floor()? Round down or round up? Should it be yesterday or today?  It seems very challenging to deal with this in a formal and standardized way.

It's my understanding that it's this ambiguity, and the need for an intermediate numerical representation like the Julian Day Number, that limits this from being an implicit SQL data type conversion operation and why users are forced to choose their own explicit conversion (to Julian date, for example) in order to do such mathematical operations. As we see in the SQL:2016 standard cast specification and conversion table (below) -- there is no standardized way to implicitly cast between DATE and numeric types because there is not just one numerical representation of a DATE.

If we look back at the use case from SPARK-10520 which is about R's summary function on a vector of dates -- (and I don't claim to be an R internals expert) but maybe it makes more sense to implement R's Date Class behavior in SparkR vs more generically across multiple languages and types.  If possible, this would seem to yield the desired behavior for the use case in question, but not broaden the requirement so wide that it becomes too challenging to implement and conflicts with other well defined standards, like ANSI SQL:2016.


Screen Shot 2020-08-21 at 12.00.01 PM.png


For reference:




On Wed, Aug 19, 2020 at 9:34 PM Driesprong, Fokko <[hidden email]> wrote:
Hi all,

Personally, I'm a big fan of the .summary() function to compute statistics of a dataframe. I often use this for debugging pipelines, and check what the impact of the RDD is after changing code.

I've noticed that not all datatypes are in this summary. Currently, there is a list of types that allowed to be included in the summary, and I love to extend that list.

The first one is the date type. It is important to define this together with the community, and that we get consensus, as this will be part of the public API. Changing this will be costly (or maybe impossible) to do.

I've checked what other DBMS'es do with averages out of dates:

Postgres

Unsupported:

postgres@366ecc8a0fb9:/$ psql
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
ERROR:  cannot cast type date to numeric
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
               ^

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
ERROR:  cannot cast type date to integer
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
               ^

The way to get the epoch in days is:

postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
date_part 
-----------
    18422
(1 row)

MySQL

Converts to a YYYYMMDD format:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
+---------------------------------------------+
| CAST(CAST('2020-01-01' AS DATE) AS decimal) |
+---------------------------------------------+
|                                    20200101 |
+---------------------------------------------+
1 row in set (0.00 sec)
However, converting to an int, isn't allowed:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'int)' at line 1

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'bigint)' at line 1
Bigquery

Unsupported:

image.png

Excel

Converts it to the days since epoch. This feels weird, but I can see it, as it is being used as a physical format internally in many data formats.

image.png

For me, returning a Date as the output of avg(date) seems like a logical choice. Internally it is handled as dates since epoch, which makes sense as well:


Any further thoughts on this before moving forward?

Kind regards, Fokko