proposal for expanded & consistent timestamp types

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

proposal for expanded & consistent timestamp types

Imran Rashid-4
Hi,

I'd like to discuss the future of timestamp support in Spark, in particular with respect of handling timezones in different SQL types.   In a nutshell:

* There are at least 3 different ways of handling the timestamp type across timezone changes
* We'd like Spark to clearly distinguish the 3 types (it currently implements 1 of them), in a way that is backwards compatible, and also compliant with the SQL standard.
* We'll get agreement across Spark, Hive, and Impala.

Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, describing the problem in more detail, the state of various SQL engines, and how we can get to a better state without breaking any current use cases.  The proposal is good for Spark by itself.  We're also going to the Hive & Impala communities with this proposal, as its better for everyone if everything is compatible.

Note that this isn't proposing a specific implementation in Spark as yet, just a description of the overall problem and our end goal.  We're going to each community to get agreement on the overall direction.  Then each community can figure out specifics as they see fit.  (I don't think there are any technical hurdles with this approach eg. to decide whether this would be even possible in Spark.)

Here's a link to the doc Zoltan has put together.  It is a bit long, but it explains how such a seemingly simple concept has become such a mess and how we can get to a better state.


Please review the proposal and let us know your opinions, concerns and suggestions.

thanks,
Imran
Reply | Threaded
Open this post in threaded view
|

Re: proposal for expanded & consistent timestamp types

Li Jin
Imran,

Thanks for sharing this. When working on interop between Spark and Pandas/Arrow in the past, we also faced some issues due to the different definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime or OffsetDateTime semantics. (Detailed discussion is in the PR: https://github.com/apache/spark/pull/18664#issuecomment-316554156.)

For one I am excited to see this effort going but also would love to see interop of Python to be included/considered in the picture. I don't think it adds much to what has already been proposed already because Python timestamps are basically LocalDateTime or OffsetDateTime.

Li



On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid <[hidden email]> wrote:
Hi,

I'd like to discuss the future of timestamp support in Spark, in particular with respect of handling timezones in different SQL types.   In a nutshell:

* There are at least 3 different ways of handling the timestamp type across timezone changes
* We'd like Spark to clearly distinguish the 3 types (it currently implements 1 of them), in a way that is backwards compatible, and also compliant with the SQL standard.
* We'll get agreement across Spark, Hive, and Impala.

Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, describing the problem in more detail, the state of various SQL engines, and how we can get to a better state without breaking any current use cases.  The proposal is good for Spark by itself.  We're also going to the Hive & Impala communities with this proposal, as its better for everyone if everything is compatible.

Note that this isn't proposing a specific implementation in Spark as yet, just a description of the overall problem and our end goal.  We're going to each community to get agreement on the overall direction.  Then each community can figure out specifics as they see fit.  (I don't think there are any technical hurdles with this approach eg. to decide whether this would be even possible in Spark.)

Here's a link to the doc Zoltan has put together.  It is a bit long, but it explains how such a seemingly simple concept has become such a mess and how we can get to a better state.


Please review the proposal and let us know your opinions, concerns and suggestions.

thanks,
Imran
Reply | Threaded
Open this post in threaded view
|

Re: proposal for expanded & consistent timestamp types

Maxim Gekk
Hello Imran,

Thank you for bringing this problem up. I have faced to the issue of handling timestamps and dates when I implemented date/timestamp parsing in CSV/JSON datasource:

Maxim Gekk

Technical Solutions Lead

Databricks B. V. 



On Fri, Dec 7, 2018 at 8:33 PM Li Jin <[hidden email]> wrote:
Imran,

Thanks for sharing this. When working on interop between Spark and Pandas/Arrow in the past, we also faced some issues due to the different definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime or OffsetDateTime semantics. (Detailed discussion is in the PR: https://github.com/apache/spark/pull/18664#issuecomment-316554156.)

For one I am excited to see this effort going but also would love to see interop of Python to be included/considered in the picture. I don't think it adds much to what has already been proposed already because Python timestamps are basically LocalDateTime or OffsetDateTime.

Li



On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid <[hidden email]> wrote:
Hi,

I'd like to discuss the future of timestamp support in Spark, in particular with respect of handling timezones in different SQL types.   In a nutshell:

* There are at least 3 different ways of handling the timestamp type across timezone changes
* We'd like Spark to clearly distinguish the 3 types (it currently implements 1 of them), in a way that is backwards compatible, and also compliant with the SQL standard.
* We'll get agreement across Spark, Hive, and Impala.

Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, describing the problem in more detail, the state of various SQL engines, and how we can get to a better state without breaking any current use cases.  The proposal is good for Spark by itself.  We're also going to the Hive & Impala communities with this proposal, as its better for everyone if everything is compatible.

Note that this isn't proposing a specific implementation in Spark as yet, just a description of the overall problem and our end goal.  We're going to each community to get agreement on the overall direction.  Then each community can figure out specifics as they see fit.  (I don't think there are any technical hurdles with this approach eg. to decide whether this would be even possible in Spark.)

Here's a link to the doc Zoltan has put together.  It is a bit long, but it explains how such a seemingly simple concept has become such a mess and how we can get to a better state.


Please review the proposal and let us know your opinions, concerns and suggestions.

thanks,
Imran
Reply | Threaded
Open this post in threaded view
|

Re: proposal for expanded & consistent timestamp types

Imran Rashid-2
In reply to this post by Li Jin
Hi Li,

thanks for the comments!  I admit I had not thought very much about python support, its a good point.  But I'd actually like to clarify one thing about the doc -- though it discusses java types, the point is actually about having support for these logical types at the SQL level.  The doc uses java names instead of SQL names just because there is so much confusion around the SQL names, as they haven't been implemented consistently.  Once there is support for the additional logical types, then we'd absolutely want to get the same support in python.  

Its great to hear there are existing python types we can map each behavior to.  Could you add a comment on the doc on each of the types, mentioning the equivalent in python?

thanks,
Imran

On Fri, Dec 7, 2018 at 1:33 PM Li Jin <[hidden email]> wrote:
Imran,

Thanks for sharing this. When working on interop between Spark and Pandas/Arrow in the past, we also faced some issues due to the different definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime or OffsetDateTime semantics. (Detailed discussion is in the PR: https://github.com/apache/spark/pull/18664#issuecomment-316554156.)

For one I am excited to see this effort going but also would love to see interop of Python to be included/considered in the picture. I don't think it adds much to what has already been proposed already because Python timestamps are basically LocalDateTime or OffsetDateTime.

Li



On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid <[hidden email]> wrote:
Hi,

I'd like to discuss the future of timestamp support in Spark, in particular with respect of handling timezones in different SQL types.   In a nutshell:

* There are at least 3 different ways of handling the timestamp type across timezone changes
* We'd like Spark to clearly distinguish the 3 types (it currently implements 1 of them), in a way that is backwards compatible, and also compliant with the SQL standard.
* We'll get agreement across Spark, Hive, and Impala.

Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, describing the problem in more detail, the state of various SQL engines, and how we can get to a better state without breaking any current use cases.  The proposal is good for Spark by itself.  We're also going to the Hive & Impala communities with this proposal, as its better for everyone if everything is compatible.

Note that this isn't proposing a specific implementation in Spark as yet, just a description of the overall problem and our end goal.  We're going to each community to get agreement on the overall direction.  Then each community can figure out specifics as they see fit.  (I don't think there are any technical hurdles with this approach eg. to decide whether this would be even possible in Spark.)

Here's a link to the doc Zoltan has put together.  It is a bit long, but it explains how such a seemingly simple concept has become such a mess and how we can get to a better state.


Please review the proposal and let us know your opinions, concerns and suggestions.

thanks,
Imran
Reply | Threaded
Open this post in threaded view
|

Re: proposal for expanded & consistent timestamp types

Li Jin
Of course. I added some comments in the doc. 

On Tue, Dec 11, 2018 at 12:01 PM Imran Rashid <[hidden email]> wrote:
Hi Li,

thanks for the comments!  I admit I had not thought very much about python support, its a good point.  But I'd actually like to clarify one thing about the doc -- though it discusses java types, the point is actually about having support for these logical types at the SQL level.  The doc uses java names instead of SQL names just because there is so much confusion around the SQL names, as they haven't been implemented consistently.  Once there is support for the additional logical types, then we'd absolutely want to get the same support in python.  

Its great to hear there are existing python types we can map each behavior to.  Could you add a comment on the doc on each of the types, mentioning the equivalent in python?

thanks,
Imran

On Fri, Dec 7, 2018 at 1:33 PM Li Jin <[hidden email]> wrote:
Imran,

Thanks for sharing this. When working on interop between Spark and Pandas/Arrow in the past, we also faced some issues due to the different definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime or OffsetDateTime semantics. (Detailed discussion is in the PR: https://github.com/apache/spark/pull/18664#issuecomment-316554156.)

For one I am excited to see this effort going but also would love to see interop of Python to be included/considered in the picture. I don't think it adds much to what has already been proposed already because Python timestamps are basically LocalDateTime or OffsetDateTime.

Li



On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid <[hidden email]> wrote:
Hi,

I'd like to discuss the future of timestamp support in Spark, in particular with respect of handling timezones in different SQL types.   In a nutshell:

* There are at least 3 different ways of handling the timestamp type across timezone changes
* We'd like Spark to clearly distinguish the 3 types (it currently implements 1 of them), in a way that is backwards compatible, and also compliant with the SQL standard.
* We'll get agreement across Spark, Hive, and Impala.

Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, describing the problem in more detail, the state of various SQL engines, and how we can get to a better state without breaking any current use cases.  The proposal is good for Spark by itself.  We're also going to the Hive & Impala communities with this proposal, as its better for everyone if everything is compatible.

Note that this isn't proposing a specific implementation in Spark as yet, just a description of the overall problem and our end goal.  We're going to each community to get agreement on the overall direction.  Then each community can figure out specifics as they see fit.  (I don't think there are any technical hurdles with this approach eg. to decide whether this would be even possible in Spark.)

Here's a link to the doc Zoltan has put together.  It is a bit long, but it explains how such a seemingly simple concept has become such a mess and how we can get to a better state.


Please review the proposal and let us know your opinions, concerns and suggestions.

thanks,
Imran
Reply | Threaded
Open this post in threaded view
|

Re: proposal for expanded & consistent timestamp types

cloud0fan
I like this proposal.

> We'll get agreement across Spark, Hive, and Impala.

Shall we include Parquet and ORC? If they don't support it, it's hard for general query engines like Spark to support it.

On Wed, Dec 12, 2018 at 3:36 AM Li Jin <[hidden email]> wrote:
Of course. I added some comments in the doc. 

On Tue, Dec 11, 2018 at 12:01 PM Imran Rashid <[hidden email]> wrote:
Hi Li,

thanks for the comments!  I admit I had not thought very much about python support, its a good point.  But I'd actually like to clarify one thing about the doc -- though it discusses java types, the point is actually about having support for these logical types at the SQL level.  The doc uses java names instead of SQL names just because there is so much confusion around the SQL names, as they haven't been implemented consistently.  Once there is support for the additional logical types, then we'd absolutely want to get the same support in python.  

Its great to hear there are existing python types we can map each behavior to.  Could you add a comment on the doc on each of the types, mentioning the equivalent in python?

thanks,
Imran

On Fri, Dec 7, 2018 at 1:33 PM Li Jin <[hidden email]> wrote:
Imran,

Thanks for sharing this. When working on interop between Spark and Pandas/Arrow in the past, we also faced some issues due to the different definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime or OffsetDateTime semantics. (Detailed discussion is in the PR: https://github.com/apache/spark/pull/18664#issuecomment-316554156.)

For one I am excited to see this effort going but also would love to see interop of Python to be included/considered in the picture. I don't think it adds much to what has already been proposed already because Python timestamps are basically LocalDateTime or OffsetDateTime.

Li



On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid <[hidden email]> wrote:
Hi,

I'd like to discuss the future of timestamp support in Spark, in particular with respect of handling timezones in different SQL types.   In a nutshell:

* There are at least 3 different ways of handling the timestamp type across timezone changes
* We'd like Spark to clearly distinguish the 3 types (it currently implements 1 of them), in a way that is backwards compatible, and also compliant with the SQL standard.
* We'll get agreement across Spark, Hive, and Impala.

Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, describing the problem in more detail, the state of various SQL engines, and how we can get to a better state without breaking any current use cases.  The proposal is good for Spark by itself.  We're also going to the Hive & Impala communities with this proposal, as its better for everyone if everything is compatible.

Note that this isn't proposing a specific implementation in Spark as yet, just a description of the overall problem and our end goal.  We're going to each community to get agreement on the overall direction.  Then each community can figure out specifics as they see fit.  (I don't think there are any technical hurdles with this approach eg. to decide whether this would be even possible in Spark.)

Here's a link to the doc Zoltan has put together.  It is a bit long, but it explains how such a seemingly simple concept has become such a mess and how we can get to a better state.


Please review the proposal and let us know your opinions, concerns and suggestions.

thanks,
Imran
Reply | Threaded
Open this post in threaded view
|

Re: proposal for expanded & consistent timestamp types

Zoltan Ivanfi-2
Hi,

On Sun, Dec 16, 2018 at 4:43 AM Wenchen Fan <[hidden email]> wrote:

> Shall we include Parquet and ORC? If they don't support it, it's hard for general query engines like Spark to support it.

For each of the more explicit timestamp types we propose a single
semantics regardless of the file format. Query engines and other
applications must explicitly support the new semantics, but it is not
strictly necessary to extend or modify the file formats themselves,
since users can declare the desired semantics directly in the end-user
applications:

- In SQL they would do so by using the more explicit timestamp types
as detailed in the proposal. And since the SQL engines in question
share the same metastore, users only have to define/update the SQL
schema once to achieve interoperability in SQL.

- Other applications will have to add support for the different
semantics, but due to the large number of such applications, we can
not coordinate all of that effort. Hopefully though, if we add support
in the three major Hadoop SQL engines, other applications will follow
suit.

- Spark, specifically, falls into both of the categories mentioned
above. It supports SQL queries, where it gets the benefit of the SQL
schemas shared via the metastore. It also supports reading data files
directly, where the correct timestamp semantics to use would have to
be declared programmatically by the user/consumer of the API.

That being said, although not strictly necessary, it is beneficial to
store the semantics in some file-level metadata as well. This allows
writers to record the intended semantics of timestamps and readers to
recognize it, so no input is needed from the user when data is
ingested from or exported to other tools. It will still require
explicit support from the applications though. Parquet does have such
metadata about the timestamp semantics: the isAdjustedToUTC field is
part of the new parametric timestamp logical type. True means Instant
semantics, while false means LocalDateTime semantics.

I support the idea of adding similar metadata to other file formats as
well, but I consider that to be a second step. First I would like to
reach an agreement in how different SQL timestamp types should behave.
(Until we follow this up with that second step, file formats with a
single non-parametric timestamp type can store arbitrary semantics
too, users just have to be aware of what timestamp semantics were used
when they create a SQL table over the data or read it in non-SQL
applications. Alternatively, we may limit the new types to file
formats with timestamp semantics metadata and postpone support for
other file formats until semantics metadata is added to them.)

Br,

Zoltan

>
> On Wed, Dec 12, 2018 at 3:36 AM Li Jin <[hidden email]> wrote:
>>
>> Of course. I added some comments in the doc.
>>
>> On Tue, Dec 11, 2018 at 12:01 PM Imran Rashid <[hidden email]> wrote:
>>>
>>> Hi Li,
>>>
>>> thanks for the comments!  I admit I had not thought very much about python support, its a good point.  But I'd actually like to clarify one thing about the doc -- though it discusses java types, the point is actually about having support for these logical types at the SQL level.  The doc uses java names instead of SQL names just because there is so much confusion around the SQL names, as they haven't been implemented consistently.  Once there is support for the additional logical types, then we'd absolutely want to get the same support in python.
>>>
>>> Its great to hear there are existing python types we can map each behavior to.  Could you add a comment on the doc on each of the types, mentioning the equivalent in python?
>>>
>>> thanks,
>>> Imran
>>>
>>> On Fri, Dec 7, 2018 at 1:33 PM Li Jin <[hidden email]> wrote:
>>>>
>>>> Imran,
>>>>
>>>> Thanks for sharing this. When working on interop between Spark and Pandas/Arrow in the past, we also faced some issues due to the different definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime or OffsetDateTime semantics. (Detailed discussion is in the PR: https://github.com/apache/spark/pull/18664#issuecomment-316554156.)
>>>>
>>>> For one I am excited to see this effort going but also would love to see interop of Python to be included/considered in the picture. I don't think it adds much to what has already been proposed already because Python timestamps are basically LocalDateTime or OffsetDateTime.
>>>>
>>>> Li
>>>>
>>>>
>>>>
>>>> On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid <[hidden email]> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I'd like to discuss the future of timestamp support in Spark, in particular with respect of handling timezones in different SQL types.   In a nutshell:
>>>>>
>>>>> * There are at least 3 different ways of handling the timestamp type across timezone changes
>>>>> * We'd like Spark to clearly distinguish the 3 types (it currently implements 1 of them), in a way that is backwards compatible, and also compliant with the SQL standard.
>>>>> * We'll get agreement across Spark, Hive, and Impala.
>>>>>
>>>>> Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, describing the problem in more detail, the state of various SQL engines, and how we can get to a better state without breaking any current use cases.  The proposal is good for Spark by itself.  We're also going to the Hive & Impala communities with this proposal, as its better for everyone if everything is compatible.
>>>>>
>>>>> Note that this isn't proposing a specific implementation in Spark as yet, just a description of the overall problem and our end goal.  We're going to each community to get agreement on the overall direction.  Then each community can figure out specifics as they see fit.  (I don't think there are any technical hurdles with this approach eg. to decide whether this would be even possible in Spark.)
>>>>>
>>>>> Here's a link to the doc Zoltan has put together.  It is a bit long, but it explains how such a seemingly simple concept has become such a mess and how we can get to a better state.
>>>>>
>>>>> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.dq3b1mwkrfky
>>>>>
>>>>> Please review the proposal and let us know your opinions, concerns and suggestions.
>>>>>
>>>>> thanks,
>>>>> Imran

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: proposal for expanded & consistent timestamp types

Steve Loughran


On 17 Dec 2018, at 17:44, Zoltan Ivanfi <[hidden email]> wrote:

Hi,

On Sun, Dec 16, 2018 at 4:43 AM Wenchen Fan <[hidden email]> wrote:

Shall we include Parquet and ORC? If they don't support it, it's hard for general query engines like Spark to support it.

For each of the more explicit timestamp types we propose a single
semantics regardless of the file format. Query engines and other
applications must explicitly support the new semantics, but it is not
strictly necessary to extend or modify the file formats themselves,
since users can declare the desired semantics directly in the end-user
applications:

- In SQL they would do so by using the more explicit timestamp types
as detailed in the proposal. And since the SQL engines in question
share the same metastore, users only have to define/update the SQL
schema once to achieve interoperability in SQL.

- Other applications will have to add support for the different
semantics, but due to the large number of such applications, we can
not coordinate all of that effort. Hopefully though, if we add support
in the three major Hadoop SQL engines, other applications will follow
suit.

- Spark, specifically, falls into both of the categories mentioned
above. It supports SQL queries, where it gets the benefit of the SQL
schemas shared via the metastore. It also supports reading data files
directly, where the correct timestamp semantics to use would have to
be declared programmatically by the user/consumer of the API.

That being said, although not strictly necessary, it is beneficial to
store the semantics in some file-level metadata as well. This allows
writers to record the intended semantics of timestamps and readers to
recognize it, so no input is needed from the user when data is
ingested from or exported to other tools. It will still require
explicit support from the applications though. Parquet does have such
metadata about the timestamp semantics: the isAdjustedToUTC field is
part of the new parametric timestamp logical type. True means Instant
semantics, while false means LocalDateTime semantics.

I support the idea of adding similar metadata to other file formats as
well, but I consider that to be a second step.

ORC has long had a timestamp format. If extra attributes are needed on a timestamp, as long as the default "no metadata" value isn't changed, then at the file level things should be OK. 

more problematic is: what would happen to an existing app reading in timestamps and ignoring any extra attributes. That way lies trouble

First I would like to
reach an agreement in how different SQL timestamp types should behave.
(Until we follow this up with that second step, file formats with a
single non-parametric timestamp type can store arbitrary semantics
too, users just have to be aware of what timestamp semantics were used
when they create a SQL table over the data or read it in non-SQL
applications. Alternatively, we may limit the new types to file
formats with timestamp semantics metadata and postpone support for
other file formats until semantics metadata is added to them.)

Talk to the format groups sooner rather than later



Br,

Zoltan


On Wed, Dec 12, 2018 at 3:36 AM Li Jin <[hidden email]> wrote:

Of course. I added some comments in the doc.

On Tue, Dec 11, 2018 at 12:01 PM Imran Rashid <[hidden email]> wrote:

Hi Li,

thanks for the comments!  I admit I had not thought very much about python support, its a good point.  But I'd actually like to clarify one thing about the doc -- though it discusses java types, the point is actually about having support for these logical types at the SQL level.  The doc uses java names instead of SQL names just because there is so much confusion around the SQL names, as they haven't been implemented consistently.  Once there is support for the additional logical types, then we'd absolutely want to get the same support in python.

Its great to hear there are existing python types we can map each behavior to.  Could you add a comment on the doc on each of the types, mentioning the equivalent in python?

thanks,
Imran

On Fri, Dec 7, 2018 at 1:33 PM Li Jin <[hidden email]> wrote:

Imran,

Thanks for sharing this. When working on interop between Spark and Pandas/Arrow in the past, we also faced some issues due to the different definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime or OffsetDateTime semantics. (Detailed discussion is in the PR: https://github.com/apache/spark/pull/18664#issuecomment-316554156.)

For one I am excited to see this effort going but also would love to see interop of Python to be included/considered in the picture. I don't think it adds much to what has already been proposed already because Python timestamps are basically LocalDateTime or OffsetDateTime.

Li



On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid <[hidden email]> wrote:

Hi,

I'd like to discuss the future of timestamp support in Spark, in particular with respect of handling timezones in different SQL types.   In a nutshell:

* There are at least 3 different ways of handling the timestamp type across timezone changes
* We'd like Spark to clearly distinguish the 3 types (it currently implements 1 of them), in a way that is backwards compatible, and also compliant with the SQL standard.
* We'll get agreement across Spark, Hive, and Impala.

Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, describing the problem in more detail, the state of various SQL engines, and how we can get to a better state without breaking any current use cases.  The proposal is good for Spark by itself.  We're also going to the Hive & Impala communities with this proposal, as its better for everyone if everything is compatible.

Note that this isn't proposing a specific implementation in Spark as yet, just a description of the overall problem and our end goal.  We're going to each community to get agreement on the overall direction.  Then each community can figure out specifics as they see fit.  (I don't think there are any technical hurdles with this approach eg. to decide whether this would be even possible in Spark.)

Here's a link to the doc Zoltan has put together.  It is a bit long, but it explains how such a seemingly simple concept has become such a mess and how we can get to a better state.

https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.dq3b1mwkrfky

Please review the proposal and let us know your opinions, concerns and suggestions.

thanks,
Imran

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: proposal for expanded & consistent timestamp types

Steve Loughran
OK, I've seen the document now. Probably the best summary of timestamps out there I've ever seen.

Irrespective of what historical stuff has done, the goal should be "make everything consistent enough that cut and paste SQL queries over the same data works" and "you shouldn't have to care about the persistence format *or which app created the data*

What does Arrow do in this world, incidentally?


On 2 Jan 2019, at 11:48, Steve Loughran <[hidden email]> wrote:



On 17 Dec 2018, at 17:44, Zoltan Ivanfi <[hidden email]> wrote:

Hi,

On Sun, Dec 16, 2018 at 4:43 AM Wenchen Fan <[hidden email]> wrote:

Shall we include Parquet and ORC? If they don't support it, it's hard for general query engines like Spark to support it.

For each of the more explicit timestamp types we propose a single
semantics regardless of the file format. Query engines and other
applications must explicitly support the new semantics, but it is not
strictly necessary to extend or modify the file formats themselves,
since users can declare the desired semantics directly in the end-user
applications:

- In SQL they would do so by using the more explicit timestamp types
as detailed in the proposal. And since the SQL engines in question
share the same metastore, users only have to define/update the SQL
schema once to achieve interoperability in SQL.

- Other applications will have to add support for the different
semantics, but due to the large number of such applications, we can
not coordinate all of that effort. Hopefully though, if we add support
in the three major Hadoop SQL engines, other applications will follow
suit.

- Spark, specifically, falls into both of the categories mentioned
above. It supports SQL queries, where it gets the benefit of the SQL
schemas shared via the metastore. It also supports reading data files
directly, where the correct timestamp semantics to use would have to
be declared programmatically by the user/consumer of the API.

That being said, although not strictly necessary, it is beneficial to
store the semantics in some file-level metadata as well. This allows
writers to record the intended semantics of timestamps and readers to
recognize it, so no input is needed from the user when data is
ingested from or exported to other tools. It will still require
explicit support from the applications though. Parquet does have such
metadata about the timestamp semantics: the isAdjustedToUTC field is
part of the new parametric timestamp logical type. True means Instant
semantics, while false means LocalDateTime semantics.

I support the idea of adding similar metadata to other file formats as
well, but I consider that to be a second step. 

ORC has long had a timestamp format. If extra attributes are needed on a timestamp, as long as the default "no metadata" value isn't changed, then at the file level things should be OK. 

more problematic is: what would happen to an existing app reading in timestamps and ignoring any extra attributes. That way lies trouble

First I would like to
reach an agreement in how different SQL timestamp types should behave.
(Until we follow this up with that second step, file formats with a
single non-parametric timestamp type can store arbitrary semantics
too, users just have to be aware of what timestamp semantics were used
when they create a SQL table over the data or read it in non-SQL
applications. Alternatively, we may limit the new types to file
formats with timestamp semantics metadata and postpone support for
other file formats until semantics metadata is added to them.)

Talk to the format groups sooner rather than later



Br,

Zoltan


On Wed, Dec 12, 2018 at 3:36 AM Li Jin <[hidden email]> wrote:

Of course. I added some comments in the doc.

On Tue, Dec 11, 2018 at 12:01 PM Imran Rashid <[hidden email]> wrote:

Hi Li,

thanks for the comments!  I admit I had not thought very much about python support, its a good point.  But I'd actually like to clarify one thing about the doc -- though it discusses java types, the point is actually about having support for these logical types at the SQL level.  The doc uses java names instead of SQL names just because there is so much confusion around the SQL names, as they haven't been implemented consistently.  Once there is support for the additional logical types, then we'd absolutely want to get the same support in python.

Its great to hear there are existing python types we can map each behavior to.  Could you add a comment on the doc on each of the types, mentioning the equivalent in python?

thanks,
Imran

On Fri, Dec 7, 2018 at 1:33 PM Li Jin <[hidden email]> wrote:

Imran,

Thanks for sharing this. When working on interop between Spark and Pandas/Arrow in the past, we also faced some issues due to the different definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime or OffsetDateTime semantics. (Detailed discussion is in the PR: https://github.com/apache/spark/pull/18664#issuecomment-316554156.)

For one I am excited to see this effort going but also would love to see interop of Python to be included/considered in the picture. I don't think it adds much to what has already been proposed already because Python timestamps are basically LocalDateTime or OffsetDateTime.

Li



On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid <[hidden email]> wrote:

Hi,

I'd like to discuss the future of timestamp support in Spark, in particular with respect of handling timezones in different SQL types.   In a nutshell:

* There are at least 3 different ways of handling the timestamp type across timezone changes
* We'd like Spark to clearly distinguish the 3 types (it currently implements 1 of them), in a way that is backwards compatible, and also compliant with the SQL standard.
* We'll get agreement across Spark, Hive, and Impala.

Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, describing the problem in more detail, the state of various SQL engines, and how we can get to a better state without breaking any current use cases.  The proposal is good for Spark by itself.  We're also going to the Hive & Impala communities with this proposal, as its better for everyone if everything is compatible.

Note that this isn't proposing a specific implementation in Spark as yet, just a description of the overall problem and our end goal.  We're going to each community to get agreement on the overall direction.  Then each community can figure out specifics as they see fit.  (I don't think there are any technical hurdles with this approach eg. to decide whether this would be even possible in Spark.)

Here's a link to the doc Zoltan has put together.  It is a bit long, but it explains how such a seemingly simple concept has become such a mess and how we can get to a better state.

https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.dq3b1mwkrfky

Please review the proposal and let us know your opinions, concerns and suggestions.

thanks,
Imran

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: proposal for expanded & consistent timestamp types

Zoltan Ivanfi-2
Hi,

> ORC has long had a timestamp format. If extra attributes are needed on a timestamp, as long as the default "no metadata" value isn't changed, then at the file level things should be OK.
>
> more problematic is: what would happen to an existing app reading in timestamps and ignoring any extra attributes. That way lies trouble

Maybe it would be best if the freshly introduced more explicit types
were not forwards-compatible. To be more precise, it would be enough
if only the "new" semantics were not forwards-compatible, it is fine
if older readers can read the "already existing" semantics, since that
is what they expect. Of course, this more fine-grained control is only
possible if there is a single "already existing" semantics only.
Whether that's the case or not depends on the file format as well.

> Talk to the format groups sooner rather than later

Thanks for the suggestion, I will write a small summary from that
perspective soon and contact the file format groups. I have Avro,
Parquet and ORC in mind. Any other file format group I should contact?
I plan to reach out to Arrow and Kudu as well. (Although strictly
speaking these are not file formats, yet they have their own type
systems as well.)

> What does Arrow do in this world, incidentally?

Arrow has a bit more options than just UTC-normalized or
timezone-agnostic. It supports arbitrary timezones as well:

/// The time zone is a string indicating the name of a time zone [...]
///
/// * If the time zone is null or equal to an empty string, the data is "time
/// zone naive" and shall be displayed *as is* to the user, not localized
/// to the locale of the user. [...]
///
/// * If the time zone is set to a valid value, values can be displayed as
/// "localized" to that time zone, even though the underlying 64-bit
/// integers are identical to the same data stored in UTC. [...]

https://github.com/apache/arrow/blob/master/format/Schema.fbs#L162

Br,

Zoltan



On Wed, Jan 2, 2019 at 5:36 PM Steve Loughran <[hidden email]> wrote:

>
> OK, I've seen the document now. Probably the best summary of timestamps out there I've ever seen.
>
> Irrespective of what historical stuff has done, the goal should be "make everything consistent enough that cut and paste SQL queries over the same data works" and "you shouldn't have to care about the persistence format *or which app created the data*
>
> What does Arrow do in this world, incidentally?
>
>
> On 2 Jan 2019, at 11:48, Steve Loughran <[hidden email]> wrote:
>
>
>
> On 17 Dec 2018, at 17:44, Zoltan Ivanfi <[hidden email]> wrote:
>
> Hi,
>
> On Sun, Dec 16, 2018 at 4:43 AM Wenchen Fan <[hidden email]> wrote:
>
> Shall we include Parquet and ORC? If they don't support it, it's hard for general query engines like Spark to support it.
>
>
> For each of the more explicit timestamp types we propose a single
> semantics regardless of the file format. Query engines and other
> applications must explicitly support the new semantics, but it is not
> strictly necessary to extend or modify the file formats themselves,
> since users can declare the desired semantics directly in the end-user
> applications:
>
> - In SQL they would do so by using the more explicit timestamp types
> as detailed in the proposal. And since the SQL engines in question
> share the same metastore, users only have to define/update the SQL
> schema once to achieve interoperability in SQL.
>
> - Other applications will have to add support for the different
> semantics, but due to the large number of such applications, we can
> not coordinate all of that effort. Hopefully though, if we add support
> in the three major Hadoop SQL engines, other applications will follow
> suit.
>
> - Spark, specifically, falls into both of the categories mentioned
> above. It supports SQL queries, where it gets the benefit of the SQL
> schemas shared via the metastore. It also supports reading data files
> directly, where the correct timestamp semantics to use would have to
> be declared programmatically by the user/consumer of the API.
>
> That being said, although not strictly necessary, it is beneficial to
> store the semantics in some file-level metadata as well. This allows
> writers to record the intended semantics of timestamps and readers to
> recognize it, so no input is needed from the user when data is
> ingested from or exported to other tools. It will still require
> explicit support from the applications though. Parquet does have such
> metadata about the timestamp semantics: the isAdjustedToUTC field is
> part of the new parametric timestamp logical type. True means Instant
> semantics, while false means LocalDateTime semantics.
>
>
> I support the idea of adding similar metadata to other file formats as
> well, but I consider that to be a second step.
>
>
> ORC has long had a timestamp format. If extra attributes are needed on a timestamp, as long as the default "no metadata" value isn't changed, then at the file level things should be OK.
>
> more problematic is: what would happen to an existing app reading in timestamps and ignoring any extra attributes. That way lies trouble
>
> First I would like to
> reach an agreement in how different SQL timestamp types should behave.
> (Until we follow this up with that second step, file formats with a
> single non-parametric timestamp type can store arbitrary semantics
> too, users just have to be aware of what timestamp semantics were used
> when they create a SQL table over the data or read it in non-SQL
> applications. Alternatively, we may limit the new types to file
> formats with timestamp semantics metadata and postpone support for
> other file formats until semantics metadata is added to them.)
>
>
> Talk to the format groups sooner rather than later
>
>
>
> Br,
>
> Zoltan
>
>
> On Wed, Dec 12, 2018 at 3:36 AM Li Jin <[hidden email]> wrote:
>
>
> Of course. I added some comments in the doc.
>
> On Tue, Dec 11, 2018 at 12:01 PM Imran Rashid <[hidden email]> wrote:
>
>
> Hi Li,
>
> thanks for the comments!  I admit I had not thought very much about python support, its a good point.  But I'd actually like to clarify one thing about the doc -- though it discusses java types, the point is actually about having support for these logical types at the SQL level.  The doc uses java names instead of SQL names just because there is so much confusion around the SQL names, as they haven't been implemented consistently.  Once there is support for the additional logical types, then we'd absolutely want to get the same support in python.
>
> Its great to hear there are existing python types we can map each behavior to.  Could you add a comment on the doc on each of the types, mentioning the equivalent in python?
>
> thanks,
> Imran
>
> On Fri, Dec 7, 2018 at 1:33 PM Li Jin <[hidden email]> wrote:
>
>
> Imran,
>
> Thanks for sharing this. When working on interop between Spark and Pandas/Arrow in the past, we also faced some issues due to the different definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime or OffsetDateTime semantics. (Detailed discussion is in the PR: https://github.com/apache/spark/pull/18664#issuecomment-316554156.)
>
> For one I am excited to see this effort going but also would love to see interop of Python to be included/considered in the picture. I don't think it adds much to what has already been proposed already because Python timestamps are basically LocalDateTime or OffsetDateTime.
>
> Li
>
>
>
> On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid <[hidden email]> wrote:
>
>
> Hi,
>
> I'd like to discuss the future of timestamp support in Spark, in particular with respect of handling timezones in different SQL types.   In a nutshell:
>
> * There are at least 3 different ways of handling the timestamp type across timezone changes
> * We'd like Spark to clearly distinguish the 3 types (it currently implements 1 of them), in a way that is backwards compatible, and also compliant with the SQL standard.
> * We'll get agreement across Spark, Hive, and Impala.
>
> Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, describing the problem in more detail, the state of various SQL engines, and how we can get to a better state without breaking any current use cases.  The proposal is good for Spark by itself.  We're also going to the Hive & Impala communities with this proposal, as its better for everyone if everything is compatible.
>
> Note that this isn't proposing a specific implementation in Spark as yet, just a description of the overall problem and our end goal.  We're going to each community to get agreement on the overall direction.  Then each community can figure out specifics as they see fit.  (I don't think there are any technical hurdles with this approach eg. to decide whether this would be even possible in Spark.)
>
> Here's a link to the doc Zoltan has put together.  It is a bit long, but it explains how such a seemingly simple concept has become such a mess and how we can get to a better state.
>
> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.dq3b1mwkrfky
>
> Please review the proposal and let us know your opinions, concerns and suggestions.
>
> thanks,
> Imran
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]