[Spark SQL] Question about support for TimeType columns in Apache Parquet files

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spark SQL] Question about support for TimeType columns in Apache Parquet files

Rylan Dmello

Hello,


Tahsin and I are trying to use the Apache Parquet file format with Spark SQL, but are running into errors when reading Parquet files that contain TimeType columns. We're wondering whether this is unsupported in Spark SQL due to an architectural limitation, or due to lack of resources?


Context: When reading some Parquet files with Spark, we get an error message like the following:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 186.0 failed 4 times, most recent failure: Lost task 0.3 in stage 186.0 (TID 1970, 10.155.249.249, executor 1): java.io.IOException: Could not read or convert schema for file: dbfs:/test/randomdata/sample001.parquet
...
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIME_MICROS);
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:106)


This only seems to occur with Parquet files that have a column with the "TimeType" (or the deprecated "TIME_MILLIS"/"TIME_MICROS") types in the Parquet file. After digging into this a bit, we think that the error message is coming from "ParquetSchemaConverter.scala" here: link.


This seems to imply that the Spark SQL engine does not support reading Parquet files with TimeType columns.

We are wondering if anyone on the mailing list could shed some more light on this: are there are architectural/datatype limitations in Spark that are resulting in this error, or is TimeType support for Parquet files something that hasn't been implemented yet due to lack of resources/interest?


Thanks,

Rylan
Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

Bart Samwel

(FWIW, a recent PR tried adding this again: https://github.com/apache/spark/pull/28858.)

On Wed, Jun 24, 2020 at 10:01 PM Rylan Dmello <[hidden email]> wrote:

Hello,


Tahsin and I are trying to use the Apache Parquet file format with Spark SQL, but are running into errors when reading Parquet files that contain TimeType columns. We're wondering whether this is unsupported in Spark SQL due to an architectural limitation, or due to lack of resources?


Context: When reading some Parquet files with Spark, we get an error message like the following:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 186.0 failed 4 times, most recent failure: Lost task 0.3 in stage 186.0 (TID 1970, 10.155.249.249, executor 1): java.io.IOException: Could not read or convert schema for file: dbfs:/test/randomdata/sample001.parquet
...
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIME_MICROS);
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:106)


This only seems to occur with Parquet files that have a column with the "TimeType" (or the deprecated "TIME_MILLIS"/"TIME_MICROS") types in the Parquet file. After digging into this a bit, we think that the error message is coming from "ParquetSchemaConverter.scala" here: link.


This seems to imply that the Spark SQL engine does not support reading Parquet files with TimeType columns.

We are wondering if anyone on the mailing list could shed some more light on this: are there are architectural/datatype limitations in Spark that are resulting in this error, or is TimeType support for Parquet files something that hasn't been implemented yet due to lack of resources/interest?


Thanks,

Rylan


--
Bart Samwel


Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

Rylan Dmello
Hello Bart,

Thank you for sharing these links, this was exactly what Tahsin and I were looking for. It looks like there has been a lot of discussion about this already, which is good to see.

In one of these pull requests, there is a comment about the number of real-world use-cases for some kind of TimeType in Spark. We could add our use-case of compatibility with Parquet's TimeType as a use-case for a new Spark TimeType.

Would it be helpful to collect/document these TimeType use-cases to gauge interest? We could add a new story or comment in the Spark JIRA or a page on the Apache Confluence if that helps.

Rylan

From: Bart Samwel <[hidden email]>
Sent: Wednesday, June 24, 2020 4:08 PM
To: Rylan Dmello <[hidden email]>
Cc: [hidden email] <[hidden email]>; Tahsin Hassan <[hidden email]>
Subject: Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files
 

(FWIW, a recent PR tried adding this again: https://github.com/apache/spark/pull/28858.)

On Wed, Jun 24, 2020 at 10:01 PM Rylan Dmello <[hidden email]> wrote:

Hello,


Tahsin and I are trying to use the Apache Parquet file format with Spark SQL, but are running into errors when reading Parquet files that contain TimeType columns. We're wondering whether this is unsupported in Spark SQL due to an architectural limitation, or due to lack of resources?


Context: When reading some Parquet files with Spark, we get an error message like the following:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 186.0 failed 4 times, most recent failure: Lost task 0.3 in stage 186.0 (TID 1970, 10.155.249.249, executor 1): java.io.IOException: Could not read or convert schema for file: dbfs:/test/randomdata/sample001.parquet
...
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIME_MICROS);
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:106)


This only seems to occur with Parquet files that have a column with the "TimeType" (or the deprecated "TIME_MILLIS"/"TIME_MICROS") types in the Parquet file. After digging into this a bit, we think that the error message is coming from "ParquetSchemaConverter.scala" here: link.


This seems to imply that the Spark SQL engine does not support reading Parquet files with TimeType columns.

We are wondering if anyone on the mailing list could shed some more light on this: are there are architectural/datatype limitations in Spark that are resulting in this error, or is TimeType support for Parquet files something that hasn't been implemented yet due to lack of resources/interest?


Thanks,

Rylan


--
Bart Samwel


Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

Bart Samwel
I can't comment on that myself, I haven't been part of the community so I don't know what is customary for this kind of thing. W.r.t. "compatibility with Parquet's TimeType", I'd like to argue that that isn't a use case by itself. The use case is "what people do with it". All in all, TIME is just clock hands, and the number of uses of that is kind of limited. Typical things that people try do do involves physical timestamps (a physical point in time, like spark's timestamp type), logical timestamps (date + hands of the clock, not associated with any time zone), or dates (logical, not associated with any time zone). The one reason I can see to have TIME is to make the type system orthogonal, i.e., to have a DATE type, a TIME type, and a DATETIME type that is a DATE plus a TIME. But is it useful by itself? Not that much. Maybe it's useful if you're building a scheduler, like cron? If you have more actual use cases for this that aren't easily satisfied in another way, then by all means share them!

On Thu, Jun 25, 2020 at 10:41 PM Rylan Dmello <[hidden email]> wrote:
Hello Bart,

Thank you for sharing these links, this was exactly what Tahsin and I were looking for. It looks like there has been a lot of discussion about this already, which is good to see.

In one of these pull requests, there is a comment about the number of real-world use-cases for some kind of TimeType in Spark. We could add our use-case of compatibility with Parquet's TimeType as a use-case for a new Spark TimeType.

Would it be helpful to collect/document these TimeType use-cases to gauge interest? We could add a new story or comment in the Spark JIRA or a page on the Apache Confluence if that helps.

Rylan

From: Bart Samwel <[hidden email]>
Sent: Wednesday, June 24, 2020 4:08 PM
To: Rylan Dmello <[hidden email]>
Cc: [hidden email] <[hidden email]>; Tahsin Hassan <[hidden email]>
Subject: Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files
 

(FWIW, a recent PR tried adding this again: https://github.com/apache/spark/pull/28858.)

On Wed, Jun 24, 2020 at 10:01 PM Rylan Dmello <[hidden email]> wrote:

Hello,


Tahsin and I are trying to use the Apache Parquet file format with Spark SQL, but are running into errors when reading Parquet files that contain TimeType columns. We're wondering whether this is unsupported in Spark SQL due to an architectural limitation, or due to lack of resources?


Context: When reading some Parquet files with Spark, we get an error message like the following:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 186.0 failed 4 times, most recent failure: Lost task 0.3 in stage 186.0 (TID 1970, 10.155.249.249, executor 1): java.io.IOException: Could not read or convert schema for file: dbfs:/test/randomdata/sample001.parquet
...
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIME_MICROS);
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:106)


This only seems to occur with Parquet files that have a column with the "TimeType" (or the deprecated "TIME_MILLIS"/"TIME_MICROS") types in the Parquet file. After digging into this a bit, we think that the error message is coming from "ParquetSchemaConverter.scala" here: link.


This seems to imply that the Spark SQL engine does not support reading Parquet files with TimeType columns.

We are wondering if anyone on the mailing list could shed some more light on this: are there are architectural/datatype limitations in Spark that are resulting in this error, or is TimeType support for Parquet files something that hasn't been implemented yet due to lack of resources/interest?


Thanks,

Rylan


--
Bart Samwel




--
Bart Samwel


Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

Maxim Gekk
Hi Bart,

> But is it useful by itself? Not that much.

I see at least the following use cases. Let's say we need to analyze some events from devices installed in different places or time zones like

At Europe/Amsterdam:
(2020-06-25, 08:10, event1)
(2020-06-26, 12:10, event1)
(2020-06-26, 18:30, event2) 

At Europe/Moscow:
(2020-06-25, 09:11, event2)
(2020-06-26, 13:09, event1)
(2020-06-26, 18:30, event3) 

While building aggregates, we don't care about dates and time zones. For example, we want to build histograms per every hour interval, or find the hours during a day when maximum number events happened. In that case, the TIME column would be useful. For sure, you can emulate the TIME column by extracting time fields from TIMESTAMP like
- creating 2-3 columns (HOUR(timestamp), MINUTE(timestamp), SECOND(timestamp)) or
- build a column with offsets from the beginning of days like 60*HOUR(timestamp) + MINUTE(timestamp)
but it is not convenient.

Another use case is to have a small table that describes process or procedure or schedule (as you mention above). 
(08:00, wake up)
(09:00, have breakfast)
(09:30, go to the office)
(18:00, come to home)

and a fact table
(2020-07-01, 10:00, Morning TV show)
(2020-08-02, 19:00, Soccer game)

So, you can join your day schedule with the fact table and find out which TV shows you can watch when you are at home.

Maxim Gekk

Software Engineer

Databricks, Inc.



On Fri, Jun 26, 2020 at 12:40 PM Bart Samwel <[hidden email]> wrote:
I can't comment on that myself, I haven't been part of the community so I don't know what is customary for this kind of thing. W.r.t. "compatibility with Parquet's TimeType", I'd like to argue that that isn't a use case by itself. The use case is "what people do with it". All in all, TIME is just clock hands, and the number of uses of that is kind of limited. Typical things that people try do do involves physical timestamps (a physical point in time, like spark's timestamp type), logical timestamps (date + hands of the clock, not associated with any time zone), or dates (logical, not associated with any time zone). The one reason I can see to have TIME is to make the type system orthogonal, i.e., to have a DATE type, a TIME type, and a DATETIME type that is a DATE plus a TIME. But is it useful by itself? Not that much. Maybe it's useful if you're building a scheduler, like cron? If you have more actual use cases for this that aren't easily satisfied in another way, then by all means share them!

On Thu, Jun 25, 2020 at 10:41 PM Rylan Dmello <[hidden email]> wrote:
Hello Bart,

Thank you for sharing these links, this was exactly what Tahsin and I were looking for. It looks like there has been a lot of discussion about this already, which is good to see.

In one of these pull requests, there is a comment about the number of real-world use-cases for some kind of TimeType in Spark. We could add our use-case of compatibility with Parquet's TimeType as a use-case for a new Spark TimeType.

Would it be helpful to collect/document these TimeType use-cases to gauge interest? We could add a new story or comment in the Spark JIRA or a page on the Apache Confluence if that helps.

Rylan

From: Bart Samwel <[hidden email]>
Sent: Wednesday, June 24, 2020 4:08 PM
To: Rylan Dmello <[hidden email]>
Cc: [hidden email] <[hidden email]>; Tahsin Hassan <[hidden email]>
Subject: Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files
 

(FWIW, a recent PR tried adding this again: https://github.com/apache/spark/pull/28858.)

On Wed, Jun 24, 2020 at 10:01 PM Rylan Dmello <[hidden email]> wrote:

Hello,


Tahsin and I are trying to use the Apache Parquet file format with Spark SQL, but are running into errors when reading Parquet files that contain TimeType columns. We're wondering whether this is unsupported in Spark SQL due to an architectural limitation, or due to lack of resources?


Context: When reading some Parquet files with Spark, we get an error message like the following:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 186.0 failed 4 times, most recent failure: Lost task 0.3 in stage 186.0 (TID 1970, 10.155.249.249, executor 1): java.io.IOException: Could not read or convert schema for file: dbfs:/test/randomdata/sample001.parquet
...
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIME_MICROS);
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:106)


This only seems to occur with Parquet files that have a column with the "TimeType" (or the deprecated "TIME_MILLIS"/"TIME_MICROS") types in the Parquet file. After digging into this a bit, we think that the error message is coming from "ParquetSchemaConverter.scala" here: link.


This seems to imply that the Spark SQL engine does not support reading Parquet files with TimeType columns.

We are wondering if anyone on the mailing list could shed some more light on this: are there are architectural/datatype limitations in Spark that are resulting in this error, or is TimeType support for Parquet files something that hasn't been implemented yet due to lack of resources/interest?


Thanks,

Rylan


--
Bart Samwel




--
Bart Samwel


Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

Bart Samwel
On Fri, Jun 26, 2020 at 12:24 PM Maxim Gekk <[hidden email]> wrote:
Hi Bart,

> But is it useful by itself? Not that much.

I see at least the following use cases. Let's say we need to analyze some events from devices installed in different places or time zones like

At Europe/Amsterdam:
(2020-06-25, 08:10, event1)
(2020-06-26, 12:10, event1)
(2020-06-26, 18:30, event2) 

At Europe/Moscow:
(2020-06-25, 09:11, event2)
(2020-06-26, 13:09, event1)
(2020-06-26, 18:30, event3) 

Isn't it a best practice to have those event timestamps be recorded in UTC? Unless you really don't care about the physical time when something happened, but that seems rare.
 
While building aggregates, we don't care about dates and time zones. For example, we want to build histograms per every hour interval, or find the hours during a day when maximum number events happened. In that case, the TIME column would be useful. For sure, you can emulate the TIME column by extracting time fields from TIMESTAMP like
- creating 2-3 columns (HOUR(timestamp), MINUTE(timestamp), SECOND(timestamp)) or
- build a column with offsets from the beginning of days like 60*HOUR(timestamp) + MINUTE(timestamp)
but it is not convenient.

Right, it's not convenient, but it works. If we were to support TIME, then you could EXTRACT(TIME FROM timestamp), maybe with some timezone annotation to say that you want the time *w.r.t. a particular timezone*. I don't know the SQL standard way of doing that -- I only know BigQuery / ZetaSQL's EXTRACT (... AT TIME ZONE ...) extension which works pretty well for that. But then you'd still have to truncate it to the granularity that you care about (5 minutes? 15 minutes? 30 minutes? 60 minutes?). Doing this with hours + minutes separately gives you a lot more flexibility, unless we also add all the right truncation functionality. (I'm not saying that we shouldn't, I'm just extrapolating what we'd need to have to support this use case.)
 
Another use case is to have a small table that describes process or procedure or schedule (as you mention above). 
(08:00, wake up)
(09:00, have breakfast)
(09:30, go to the office)
(18:00, come to home)

and a fact table
(2020-07-01, 10:00, Morning TV show)
(2020-08-02, 19:00, Soccer game)

So, you can join your day schedule with the fact table and find out which TV shows you can watch when you are at home.

Maxim Gekk

Software Engineer

Databricks, Inc.



On Fri, Jun 26, 2020 at 12:40 PM Bart Samwel <[hidden email]> wrote:
I can't comment on that myself, I haven't been part of the community so I don't know what is customary for this kind of thing. W.r.t. "compatibility with Parquet's TimeType", I'd like to argue that that isn't a use case by itself. The use case is "what people do with it". All in all, TIME is just clock hands, and the number of uses of that is kind of limited. Typical things that people try do do involves physical timestamps (a physical point in time, like spark's timestamp type), logical timestamps (date + hands of the clock, not associated with any time zone), or dates (logical, not associated with any time zone). The one reason I can see to have TIME is to make the type system orthogonal, i.e., to have a DATE type, a TIME type, and a DATETIME type that is a DATE plus a TIME. But is it useful by itself? Not that much. Maybe it's useful if you're building a scheduler, like cron? If you have more actual use cases for this that aren't easily satisfied in another way, then by all means share them!

On Thu, Jun 25, 2020 at 10:41 PM Rylan Dmello <[hidden email]> wrote:
Hello Bart,

Thank you for sharing these links, this was exactly what Tahsin and I were looking for. It looks like there has been a lot of discussion about this already, which is good to see.

In one of these pull requests, there is a comment about the number of real-world use-cases for some kind of TimeType in Spark. We could add our use-case of compatibility with Parquet's TimeType as a use-case for a new Spark TimeType.

Would it be helpful to collect/document these TimeType use-cases to gauge interest? We could add a new story or comment in the Spark JIRA or a page on the Apache Confluence if that helps.

Rylan

From: Bart Samwel <[hidden email]>
Sent: Wednesday, June 24, 2020 4:08 PM
To: Rylan Dmello <[hidden email]>
Cc: [hidden email] <[hidden email]>; Tahsin Hassan <[hidden email]>
Subject: Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files
 

(FWIW, a recent PR tried adding this again: https://github.com/apache/spark/pull/28858.)

On Wed, Jun 24, 2020 at 10:01 PM Rylan Dmello <[hidden email]> wrote:

Hello,


Tahsin and I are trying to use the Apache Parquet file format with Spark SQL, but are running into errors when reading Parquet files that contain TimeType columns. We're wondering whether this is unsupported in Spark SQL due to an architectural limitation, or due to lack of resources?


Context: When reading some Parquet files with Spark, we get an error message like the following:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 186.0 failed 4 times, most recent failure: Lost task 0.3 in stage 186.0 (TID 1970, 10.155.249.249, executor 1): java.io.IOException: Could not read or convert schema for file: dbfs:/test/randomdata/sample001.parquet
...
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIME_MICROS);
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:106)


This only seems to occur with Parquet files that have a column with the "TimeType" (or the deprecated "TIME_MILLIS"/"TIME_MICROS") types in the Parquet file. After digging into this a bit, we think that the error message is coming from "ParquetSchemaConverter.scala" here: link.


This seems to imply that the Spark SQL engine does not support reading Parquet files with TimeType columns.

We are wondering if anyone on the mailing list could shed some more light on this: are there are architectural/datatype limitations in Spark that are resulting in this error, or is TimeType support for Parquet files something that hasn't been implemented yet due to lack of resources/interest?


Thanks,

Rylan


--
Bart Samwel




--
Bart Samwel




--
Bart Samwel


Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

Maxim Gekk
Hi Bart,

> Isn't it a best practice to have those event timestamps be recorded in UTC?

The triple (date, time, time zone) from my examples can be mapped to timestamp in UTC w/o ambigues (almost). Having separate types for dates and times allows us to de-normalize your data logically, and simplify future work on data.

> Right, it's not convenient, but it works. If we were to support TIME, then you could EXTRACT(TIME FROM timestamp) ...

According to this logic, we don't need the DATE type because it can be extracted from TIMESTAMPs. Even more, the TIMESTAMP type is not needed too as it can be "extracted" from strings (or longs). Let's keep everything in strings ;-) this will work.

By not providing appropriate types, we encourage users to choose more generic types like INT for time offsets in days. This just closes doors for any optimizations in Catalyst, valid range checking and makes users apps error prone, from my point of view.

Maxim Gekk

Software Engineer

Databricks, Inc.



On Fri, Jun 26, 2020 at 3:48 PM Bart Samwel <[hidden email]> wrote:
On Fri, Jun 26, 2020 at 12:24 PM Maxim Gekk <[hidden email]> wrote:
Hi Bart,

> But is it useful by itself? Not that much.

I see at least the following use cases. Let's say we need to analyze some events from devices installed in different places or time zones like

At Europe/Amsterdam:
(2020-06-25, 08:10, event1)
(2020-06-26, 12:10, event1)
(2020-06-26, 18:30, event2) 

At Europe/Moscow:
(2020-06-25, 09:11, event2)
(2020-06-26, 13:09, event1)
(2020-06-26, 18:30, event3) 

Isn't it a best practice to have those event timestamps be recorded in UTC? Unless you really don't care about the physical time when something happened, but that seems rare.
 
While building aggregates, we don't care about dates and time zones. For example, we want to build histograms per every hour interval, or find the hours during a day when maximum number events happened. In that case, the TIME column would be useful. For sure, you can emulate the TIME column by extracting time fields from TIMESTAMP like
- creating 2-3 columns (HOUR(timestamp), MINUTE(timestamp), SECOND(timestamp)) or
- build a column with offsets from the beginning of days like 60*HOUR(timestamp) + MINUTE(timestamp)
but it is not convenient.

Right, it's not convenient, but it works. If we were to support TIME, then you could EXTRACT(TIME FROM timestamp), maybe with some timezone annotation to say that you want the time *w.r.t. a particular timezone*. I don't know the SQL standard way of doing that -- I only know BigQuery / ZetaSQL's EXTRACT (... AT TIME ZONE ...) extension which works pretty well for that. But then you'd still have to truncate it to the granularity that you care about (5 minutes? 15 minutes? 30 minutes? 60 minutes?). Doing this with hours + minutes separately gives you a lot more flexibility, unless we also add all the right truncation functionality. (I'm not saying that we shouldn't, I'm just extrapolating what we'd need to have to support this use case.)
 
Another use case is to have a small table that describes process or procedure or schedule (as you mention above). 
(08:00, wake up)
(09:00, have breakfast)
(09:30, go to the office)
(18:00, come to home)

and a fact table
(2020-07-01, 10:00, Morning TV show)
(2020-08-02, 19:00, Soccer game)

So, you can join your day schedule with the fact table and find out which TV shows you can watch when you are at home.

Maxim Gekk

Software Engineer

Databricks, Inc.



On Fri, Jun 26, 2020 at 12:40 PM Bart Samwel <[hidden email]> wrote:
I can't comment on that myself, I haven't been part of the community so I don't know what is customary for this kind of thing. W.r.t. "compatibility with Parquet's TimeType", I'd like to argue that that isn't a use case by itself. The use case is "what people do with it". All in all, TIME is just clock hands, and the number of uses of that is kind of limited. Typical things that people try do do involves physical timestamps (a physical point in time, like spark's timestamp type), logical timestamps (date + hands of the clock, not associated with any time zone), or dates (logical, not associated with any time zone). The one reason I can see to have TIME is to make the type system orthogonal, i.e., to have a DATE type, a TIME type, and a DATETIME type that is a DATE plus a TIME. But is it useful by itself? Not that much. Maybe it's useful if you're building a scheduler, like cron? If you have more actual use cases for this that aren't easily satisfied in another way, then by all means share them!

On Thu, Jun 25, 2020 at 10:41 PM Rylan Dmello <[hidden email]> wrote:
Hello Bart,

Thank you for sharing these links, this was exactly what Tahsin and I were looking for. It looks like there has been a lot of discussion about this already, which is good to see.

In one of these pull requests, there is a comment about the number of real-world use-cases for some kind of TimeType in Spark. We could add our use-case of compatibility with Parquet's TimeType as a use-case for a new Spark TimeType.

Would it be helpful to collect/document these TimeType use-cases to gauge interest? We could add a new story or comment in the Spark JIRA or a page on the Apache Confluence if that helps.

Rylan

From: Bart Samwel <[hidden email]>
Sent: Wednesday, June 24, 2020 4:08 PM
To: Rylan Dmello <[hidden email]>
Cc: [hidden email] <[hidden email]>; Tahsin Hassan <[hidden email]>
Subject: Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files
 

(FWIW, a recent PR tried adding this again: https://github.com/apache/spark/pull/28858.)

On Wed, Jun 24, 2020 at 10:01 PM Rylan Dmello <[hidden email]> wrote:

Hello,


Tahsin and I are trying to use the Apache Parquet file format with Spark SQL, but are running into errors when reading Parquet files that contain TimeType columns. We're wondering whether this is unsupported in Spark SQL due to an architectural limitation, or due to lack of resources?


Context: When reading some Parquet files with Spark, we get an error message like the following:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 186.0 failed 4 times, most recent failure: Lost task 0.3 in stage 186.0 (TID 1970, 10.155.249.249, executor 1): java.io.IOException: Could not read or convert schema for file: dbfs:/test/randomdata/sample001.parquet
...
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIME_MICROS);
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:106)


This only seems to occur with Parquet files that have a column with the "TimeType" (or the deprecated "TIME_MILLIS"/"TIME_MICROS") types in the Parquet file. After digging into this a bit, we think that the error message is coming from "ParquetSchemaConverter.scala" here: link.


This seems to imply that the Spark SQL engine does not support reading Parquet files with TimeType columns.

We are wondering if anyone on the mailing list could shed some more light on this: are there are architectural/datatype limitations in Spark that are resulting in this error, or is TimeType support for Parquet files something that hasn't been implemented yet due to lack of resources/interest?


Thanks,

Rylan


--
Bart Samwel




--
Bart Samwel




--
Bart Samwel


Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

Bart Samwel
On Fri, Jun 26, 2020 at 3:38 PM Maxim Gekk <[hidden email]> wrote:
Hi Bart,

> Isn't it a best practice to have those event timestamps be recorded in UTC?

The triple (date, time, time zone) from my examples can be mapped to timestamp in UTC w/o ambigues (almost). Having separate types for dates and times allows us to de-normalize your data logically, and simplify future work on data.

> Right, it's not convenient, but it works. If we were to support TIME, then you could EXTRACT(TIME FROM timestamp) ...

According to this logic, we don't need the DATE type because it can be extracted from TIMESTAMPs. Even more, the TIMESTAMP type is not needed too as it can be "extracted" from strings (or longs). Let's keep everything in strings ;-) this will work.

By not providing appropriate types, we encourage users to choose more generic types like INT for time offsets in days. This just closes doors for any optimizations in Catalyst, valid range checking and makes users apps error prone, from my point of view.

It all depends on how common these types are. If I would have to choose where to spend our complexity budget, there are other types that would come first on my list based on how common the use case is. I'd first go for DATETIME (logical timestamps) before I'd go for TIME by itself.
 

Maxim Gekk

Software Engineer

Databricks, Inc.



On Fri, Jun 26, 2020 at 3:48 PM Bart Samwel <[hidden email]> wrote:
On Fri, Jun 26, 2020 at 12:24 PM Maxim Gekk <[hidden email]> wrote:
Hi Bart,

> But is it useful by itself? Not that much.

I see at least the following use cases. Let's say we need to analyze some events from devices installed in different places or time zones like

At Europe/Amsterdam:
(2020-06-25, 08:10, event1)
(2020-06-26, 12:10, event1)
(2020-06-26, 18:30, event2) 

At Europe/Moscow:
(2020-06-25, 09:11, event2)
(2020-06-26, 13:09, event1)
(2020-06-26, 18:30, event3) 

Isn't it a best practice to have those event timestamps be recorded in UTC? Unless you really don't care about the physical time when something happened, but that seems rare.
 
While building aggregates, we don't care about dates and time zones. For example, we want to build histograms per every hour interval, or find the hours during a day when maximum number events happened. In that case, the TIME column would be useful. For sure, you can emulate the TIME column by extracting time fields from TIMESTAMP like
- creating 2-3 columns (HOUR(timestamp), MINUTE(timestamp), SECOND(timestamp)) or
- build a column with offsets from the beginning of days like 60*HOUR(timestamp) + MINUTE(timestamp)
but it is not convenient.

Right, it's not convenient, but it works. If we were to support TIME, then you could EXTRACT(TIME FROM timestamp), maybe with some timezone annotation to say that you want the time *w.r.t. a particular timezone*. I don't know the SQL standard way of doing that -- I only know BigQuery / ZetaSQL's EXTRACT (... AT TIME ZONE ...) extension which works pretty well for that. But then you'd still have to truncate it to the granularity that you care about (5 minutes? 15 minutes? 30 minutes? 60 minutes?). Doing this with hours + minutes separately gives you a lot more flexibility, unless we also add all the right truncation functionality. (I'm not saying that we shouldn't, I'm just extrapolating what we'd need to have to support this use case.)
 
Another use case is to have a small table that describes process or procedure or schedule (as you mention above). 
(08:00, wake up)
(09:00, have breakfast)
(09:30, go to the office)
(18:00, come to home)

and a fact table
(2020-07-01, 10:00, Morning TV show)
(2020-08-02, 19:00, Soccer game)

So, you can join your day schedule with the fact table and find out which TV shows you can watch when you are at home.

Maxim Gekk

Software Engineer

Databricks, Inc.



On Fri, Jun 26, 2020 at 12:40 PM Bart Samwel <[hidden email]> wrote:
I can't comment on that myself, I haven't been part of the community so I don't know what is customary for this kind of thing. W.r.t. "compatibility with Parquet's TimeType", I'd like to argue that that isn't a use case by itself. The use case is "what people do with it". All in all, TIME is just clock hands, and the number of uses of that is kind of limited. Typical things that people try do do involves physical timestamps (a physical point in time, like spark's timestamp type), logical timestamps (date + hands of the clock, not associated with any time zone), or dates (logical, not associated with any time zone). The one reason I can see to have TIME is to make the type system orthogonal, i.e., to have a DATE type, a TIME type, and a DATETIME type that is a DATE plus a TIME. But is it useful by itself? Not that much. Maybe it's useful if you're building a scheduler, like cron? If you have more actual use cases for this that aren't easily satisfied in another way, then by all means share them!

On Thu, Jun 25, 2020 at 10:41 PM Rylan Dmello <[hidden email]> wrote:
Hello Bart,

Thank you for sharing these links, this was exactly what Tahsin and I were looking for. It looks like there has been a lot of discussion about this already, which is good to see.

In one of these pull requests, there is a comment about the number of real-world use-cases for some kind of TimeType in Spark. We could add our use-case of compatibility with Parquet's TimeType as a use-case for a new Spark TimeType.

Would it be helpful to collect/document these TimeType use-cases to gauge interest? We could add a new story or comment in the Spark JIRA or a page on the Apache Confluence if that helps.

Rylan

From: Bart Samwel <[hidden email]>
Sent: Wednesday, June 24, 2020 4:08 PM
To: Rylan Dmello <[hidden email]>
Cc: [hidden email] <[hidden email]>; Tahsin Hassan <[hidden email]>
Subject: Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files
 

(FWIW, a recent PR tried adding this again: https://github.com/apache/spark/pull/28858.)

On Wed, Jun 24, 2020 at 10:01 PM Rylan Dmello <[hidden email]> wrote:

Hello,


Tahsin and I are trying to use the Apache Parquet file format with Spark SQL, but are running into errors when reading Parquet files that contain TimeType columns. We're wondering whether this is unsupported in Spark SQL due to an architectural limitation, or due to lack of resources?


Context: When reading some Parquet files with Spark, we get an error message like the following:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 186.0 failed 4 times, most recent failure: Lost task 0.3 in stage 186.0 (TID 1970, 10.155.249.249, executor 1): java.io.IOException: Could not read or convert schema for file: dbfs:/test/randomdata/sample001.parquet
...
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIME_MICROS);
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:106)


This only seems to occur with Parquet files that have a column with the "TimeType" (or the deprecated "TIME_MILLIS"/"TIME_MICROS") types in the Parquet file. After digging into this a bit, we think that the error message is coming from "ParquetSchemaConverter.scala" here: link.


This seems to imply that the Spark SQL engine does not support reading Parquet files with TimeType columns.

We are wondering if anyone on the mailing list could shed some more light on this: are there are architectural/datatype limitations in Spark that are resulting in this error, or is TimeType support for Parquet files something that hasn't been implemented yet due to lack of resources/interest?


Thanks,

Rylan


--
Bart Samwel




--
Bart Samwel




--
Bart Samwel




--
Bart Samwel