proposal for expanded & consistent timestamp types

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

proposal for expanded & consistent timestamp types

Imran Rashid-4
Hi,

I'd like to discuss the future of timestamp support in Spark, in particular with respect of handling timezones in different SQL types.   In a nutshell:

* There are at least 3 different ways of handling the timestamp type across timezone changes
* We'd like Spark to clearly distinguish the 3 types (it currently implements 1 of them), in a way that is backwards compatible, and also compliant with the SQL standard.
* We'll get agreement across Spark, Hive, and Impala.

Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, describing the problem in more detail, the state of various SQL engines, and how we can get to a better state without breaking any current use cases.  The proposal is good for Spark by itself.  We're also going to the Hive & Impala communities with this proposal, as its better for everyone if everything is compatible.

Note that this isn't proposing a specific implementation in Spark as yet, just a description of the overall problem and our end goal.  We're going to each community to get agreement on the overall direction.  Then each community can figure out specifics as they see fit.  (I don't think there are any technical hurdles with this approach eg. to decide whether this would be even possible in Spark.)

Here's a link to the doc Zoltan has put together.  It is a bit long, but it explains how such a seemingly simple concept has become such a mess and how we can get to a better state.


Please review the proposal and let us know your opinions, concerns and suggestions.

thanks,
Imran
Reply | Threaded
Open this post in threaded view
|

Re: proposal for expanded & consistent timestamp types

Li Jin
Imran,

Thanks for sharing this. When working on interop between Spark and Pandas/Arrow in the past, we also faced some issues due to the different definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime or OffsetDateTime semantics. (Detailed discussion is in the PR: https://github.com/apache/spark/pull/18664#issuecomment-316554156.)

For one I am excited to see this effort going but also would love to see interop of Python to be included/considered in the picture. I don't think it adds much to what has already been proposed already because Python timestamps are basically LocalDateTime or OffsetDateTime.

Li



On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid <[hidden email]> wrote:
Hi,

I'd like to discuss the future of timestamp support in Spark, in particular with respect of handling timezones in different SQL types.   In a nutshell:

* There are at least 3 different ways of handling the timestamp type across timezone changes
* We'd like Spark to clearly distinguish the 3 types (it currently implements 1 of them), in a way that is backwards compatible, and also compliant with the SQL standard.
* We'll get agreement across Spark, Hive, and Impala.

Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, describing the problem in more detail, the state of various SQL engines, and how we can get to a better state without breaking any current use cases.  The proposal is good for Spark by itself.  We're also going to the Hive & Impala communities with this proposal, as its better for everyone if everything is compatible.

Note that this isn't proposing a specific implementation in Spark as yet, just a description of the overall problem and our end goal.  We're going to each community to get agreement on the overall direction.  Then each community can figure out specifics as they see fit.  (I don't think there are any technical hurdles with this approach eg. to decide whether this would be even possible in Spark.)

Here's a link to the doc Zoltan has put together.  It is a bit long, but it explains how such a seemingly simple concept has become such a mess and how we can get to a better state.


Please review the proposal and let us know your opinions, concerns and suggestions.

thanks,
Imran
Reply | Threaded
Open this post in threaded view
|

Re: proposal for expanded & consistent timestamp types

Maxim Gekk
Hello Imran,

Thank you for bringing this problem up. I have faced to the issue of handling timestamps and dates when I implemented date/timestamp parsing in CSV/JSON datasource:

Maxim Gekk

Technical Solutions Lead

Databricks B. V. 



On Fri, Dec 7, 2018 at 8:33 PM Li Jin <[hidden email]> wrote:
Imran,

Thanks for sharing this. When working on interop between Spark and Pandas/Arrow in the past, we also faced some issues due to the different definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime or OffsetDateTime semantics. (Detailed discussion is in the PR: https://github.com/apache/spark/pull/18664#issuecomment-316554156.)

For one I am excited to see this effort going but also would love to see interop of Python to be included/considered in the picture. I don't think it adds much to what has already been proposed already because Python timestamps are basically LocalDateTime or OffsetDateTime.

Li



On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid <[hidden email]> wrote:
Hi,

I'd like to discuss the future of timestamp support in Spark, in particular with respect of handling timezones in different SQL types.   In a nutshell:

* There are at least 3 different ways of handling the timestamp type across timezone changes
* We'd like Spark to clearly distinguish the 3 types (it currently implements 1 of them), in a way that is backwards compatible, and also compliant with the SQL standard.
* We'll get agreement across Spark, Hive, and Impala.

Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, describing the problem in more detail, the state of various SQL engines, and how we can get to a better state without breaking any current use cases.  The proposal is good for Spark by itself.  We're also going to the Hive & Impala communities with this proposal, as its better for everyone if everything is compatible.

Note that this isn't proposing a specific implementation in Spark as yet, just a description of the overall problem and our end goal.  We're going to each community to get agreement on the overall direction.  Then each community can figure out specifics as they see fit.  (I don't think there are any technical hurdles with this approach eg. to decide whether this would be even possible in Spark.)

Here's a link to the doc Zoltan has put together.  It is a bit long, but it explains how such a seemingly simple concept has become such a mess and how we can get to a better state.


Please review the proposal and let us know your opinions, concerns and suggestions.

thanks,
Imran
Reply | Threaded
Open this post in threaded view
|

Re: proposal for expanded & consistent timestamp types

Imran Rashid-2
In reply to this post by Li Jin
Hi Li,

thanks for the comments!  I admit I had not thought very much about python support, its a good point.  But I'd actually like to clarify one thing about the doc -- though it discusses java types, the point is actually about having support for these logical types at the SQL level.  The doc uses java names instead of SQL names just because there is so much confusion around the SQL names, as they haven't been implemented consistently.  Once there is support for the additional logical types, then we'd absolutely want to get the same support in python.  

Its great to hear there are existing python types we can map each behavior to.  Could you add a comment on the doc on each of the types, mentioning the equivalent in python?

thanks,
Imran

On Fri, Dec 7, 2018 at 1:33 PM Li Jin <[hidden email]> wrote:
Imran,

Thanks for sharing this. When working on interop between Spark and Pandas/Arrow in the past, we also faced some issues due to the different definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime or OffsetDateTime semantics. (Detailed discussion is in the PR: https://github.com/apache/spark/pull/18664#issuecomment-316554156.)

For one I am excited to see this effort going but also would love to see interop of Python to be included/considered in the picture. I don't think it adds much to what has already been proposed already because Python timestamps are basically LocalDateTime or OffsetDateTime.

Li



On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid <[hidden email]> wrote:
Hi,

I'd like to discuss the future of timestamp support in Spark, in particular with respect of handling timezones in different SQL types.   In a nutshell:

* There are at least 3 different ways of handling the timestamp type across timezone changes
* We'd like Spark to clearly distinguish the 3 types (it currently implements 1 of them), in a way that is backwards compatible, and also compliant with the SQL standard.
* We'll get agreement across Spark, Hive, and Impala.

Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, describing the problem in more detail, the state of various SQL engines, and how we can get to a better state without breaking any current use cases.  The proposal is good for Spark by itself.  We're also going to the Hive & Impala communities with this proposal, as its better for everyone if everything is compatible.

Note that this isn't proposing a specific implementation in Spark as yet, just a description of the overall problem and our end goal.  We're going to each community to get agreement on the overall direction.  Then each community can figure out specifics as they see fit.  (I don't think there are any technical hurdles with this approach eg. to decide whether this would be even possible in Spark.)

Here's a link to the doc Zoltan has put together.  It is a bit long, but it explains how such a seemingly simple concept has become such a mess and how we can get to a better state.


Please review the proposal and let us know your opinions, concerns and suggestions.

thanks,
Imran
Reply | Threaded
Open this post in threaded view
|

Re: proposal for expanded & consistent timestamp types

Li Jin
Of course. I added some comments in the doc. 

On Tue, Dec 11, 2018 at 12:01 PM Imran Rashid <[hidden email]> wrote:
Hi Li,

thanks for the comments!  I admit I had not thought very much about python support, its a good point.  But I'd actually like to clarify one thing about the doc -- though it discusses java types, the point is actually about having support for these logical types at the SQL level.  The doc uses java names instead of SQL names just because there is so much confusion around the SQL names, as they haven't been implemented consistently.  Once there is support for the additional logical types, then we'd absolutely want to get the same support in python.  

Its great to hear there are existing python types we can map each behavior to.  Could you add a comment on the doc on each of the types, mentioning the equivalent in python?

thanks,
Imran

On Fri, Dec 7, 2018 at 1:33 PM Li Jin <[hidden email]> wrote:
Imran,

Thanks for sharing this. When working on interop between Spark and Pandas/Arrow in the past, we also faced some issues due to the different definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime or OffsetDateTime semantics. (Detailed discussion is in the PR: https://github.com/apache/spark/pull/18664#issuecomment-316554156.)

For one I am excited to see this effort going but also would love to see interop of Python to be included/considered in the picture. I don't think it adds much to what has already been proposed already because Python timestamps are basically LocalDateTime or OffsetDateTime.

Li



On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid <[hidden email]> wrote:
Hi,

I'd like to discuss the future of timestamp support in Spark, in particular with respect of handling timezones in different SQL types.   In a nutshell:

* There are at least 3 different ways of handling the timestamp type across timezone changes
* We'd like Spark to clearly distinguish the 3 types (it currently implements 1 of them), in a way that is backwards compatible, and also compliant with the SQL standard.
* We'll get agreement across Spark, Hive, and Impala.

Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed doc, describing the problem in more detail, the state of various SQL engines, and how we can get to a better state without breaking any current use cases.  The proposal is good for Spark by itself.  We're also going to the Hive & Impala communities with this proposal, as its better for everyone if everything is compatible.

Note that this isn't proposing a specific implementation in Spark as yet, just a description of the overall problem and our end goal.  We're going to each community to get agreement on the overall direction.  Then each community can figure out specifics as they see fit.  (I don't think there are any technical hurdles with this approach eg. to decide whether this would be even possible in Spark.)

Here's a link to the doc Zoltan has put together.  It is a bit long, but it explains how such a seemingly simple concept has become such a mess and how we can get to a better state.


Please review the proposal and let us know your opinions, concerns and suggestions.

thanks,
Imran