[discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

Takuya UESHIN
Hi all,

I'd like to raise a discussion about Pandas version.
Originally we are discussing it at https://github.com/apache/spark/pull/19607 but we'd like to ask for feedback from community.


Currently we don't explicitly specify the Pandas version we are supporting but we need to decide what version we should support because:

  - There have been a number of API evolutions around extension dtypes that make supporting pandas 0.18.x and lower challenging.

  - Sometimes Pandas older than 0.19.2 doesn't handle timestamp values properly. We want to provide properer support for timestamp values.

  - If users want to use vectorized UDFs, or toPandas / createDataFrame from Pandas DataFrame with Arrow which will be released in Spark 2.3, users have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow internally, which supports only 0.19.2 or upper. 


The point I'd like to ask is:

Can we drop support old Pandas (<0.19.2)?
If not, what version should we support?


References:

- vectorized UDF
- toPandas with Arrow
- createDataFrame from pandas DataFrame with Arrow
  - https://github.com/apache/spark/pull/19646


Any comments are welcome!

Thanks.

--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin
Reply | Threaded
Open this post in threaded view
|

Re: [discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

Hyukjin Kwon
+0 to drop it as I said in the PR. I am seeing It brings a lot of hard time to get the cool changes through, and is slowing down them to get pushed.

My only worry is, users who depends on lower pandas versions (Pandas 0.19.2 seems released less then a year before. In the similar time, Spark 2.1.0 was released).

If this worry is less than I expected, I definitely support it. It should speed up those cool changes.


On 14 Nov 2017 7:14 pm, "Takuya UESHIN" <[hidden email]> wrote:
Hi all,

I'd like to raise a discussion about Pandas version.
Originally we are discussing it at https://github.com/apache/spark/pull/19607 but we'd like to ask for feedback from community.


Currently we don't explicitly specify the Pandas version we are supporting but we need to decide what version we should support because:

  - There have been a number of API evolutions around extension dtypes that make supporting pandas 0.18.x and lower challenging.

  - Sometimes Pandas older than 0.19.2 doesn't handle timestamp values properly. We want to provide properer support for timestamp values.

  - If users want to use vectorized UDFs, or toPandas / createDataFrame from Pandas DataFrame with Arrow which will be released in Spark 2.3, users have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow internally, which supports only 0.19.2 or upper. 


The point I'd like to ask is:

Can we drop support old Pandas (<0.19.2)?
If not, what version should we support?


References:

- vectorized UDF
- toPandas with Arrow
- createDataFrame from pandas DataFrame with Arrow
  - https://github.com/apache/spark/pull/19646


Any comments are welcome!

Thanks.

--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

Reply | Threaded
Open this post in threaded view
|

Re: [discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

Li Jin
I think this makes sense. PySpark/Pandas interops in 2.3 are new anyway, I don't think we need to support the new functionality with older version of pandas (Takuya's reason 3)

One thing I am not sure is how complicated it is to support pandas < 0.19.2 with old non-Arrow interops and require pandas >= 0.19.2 for new Arrow interops. Maybe it makes sense to allow user keep using their PySpark code if they don't want to use any of the new stuff. If this is still complicated, I would be leaning towards not supporting < 0.19.2.


On Tue, Nov 14, 2017 at 6:04 AM, Hyukjin Kwon <[hidden email]> wrote:
+0 to drop it as I said in the PR. I am seeing It brings a lot of hard time to get the cool changes through, and is slowing down them to get pushed.

My only worry is, users who depends on lower pandas versions (Pandas 0.19.2 seems released less then a year before. In the similar time, Spark 2.1.0 was released).

If this worry is less than I expected, I definitely support it. It should speed up those cool changes.


On 14 Nov 2017 7:14 pm, "Takuya UESHIN" <[hidden email]> wrote:
Hi all,

I'd like to raise a discussion about Pandas version.
Originally we are discussing it at https://github.com/apache/spark/pull/19607 but we'd like to ask for feedback from community.


Currently we don't explicitly specify the Pandas version we are supporting but we need to decide what version we should support because:

  - There have been a number of API evolutions around extension dtypes that make supporting pandas 0.18.x and lower challenging.

  - Sometimes Pandas older than 0.19.2 doesn't handle timestamp values properly. We want to provide properer support for timestamp values.

  - If users want to use vectorized UDFs, or toPandas / createDataFrame from Pandas DataFrame with Arrow which will be released in Spark 2.3, users have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow internally, which supports only 0.19.2 or upper. 


The point I'd like to ask is:

Can we drop support old Pandas (<0.19.2)?
If not, what version should we support?


References:

- vectorized UDF
- toPandas with Arrow
- createDataFrame from pandas DataFrame with Arrow
  - https://github.com/apache/spark/pull/19646


Any comments are welcome!

Thanks.

--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Reply | Threaded
Open this post in threaded view
|

Re: [discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

Takuya UESHIN
Thanks for feedback.

Hyukjin Kwon:
> My only worry is, users who depends on lower pandas versions

That's what I worried and one of the reasons I moved this discussion here.

Li Jin:
> how complicated it is to support pandas < 0.19.2 with old non-Arrow interops

In my original PR (https://github.com/apache/spark/pull/19607) we will fix the behavior of timestamp values for Pandas.
If we need to support old Pandas, we will need at least some workarounds like in the following link:


Thanks.


On Wed, Nov 15, 2017 at 12:59 AM, Li Jin <[hidden email]> wrote:
I think this makes sense. PySpark/Pandas interops in 2.3 are new anyway, I don't think we need to support the new functionality with older version of pandas (Takuya's reason 3)

One thing I am not sure is how complicated it is to support pandas < 0.19.2 with old non-Arrow interops and require pandas >= 0.19.2 for new Arrow interops. Maybe it makes sense to allow user keep using their PySpark code if they don't want to use any of the new stuff. If this is still complicated, I would be leaning towards not supporting < 0.19.2.


On Tue, Nov 14, 2017 at 6:04 AM, Hyukjin Kwon <[hidden email]> wrote:
+0 to drop it as I said in the PR. I am seeing It brings a lot of hard time to get the cool changes through, and is slowing down them to get pushed.

My only worry is, users who depends on lower pandas versions (Pandas 0.19.2 seems released less then a year before. In the similar time, Spark 2.1.0 was released).

If this worry is less than I expected, I definitely support it. It should speed up those cool changes.


On 14 Nov 2017 7:14 pm, "Takuya UESHIN" <[hidden email]> wrote:
Hi all,

I'd like to raise a discussion about Pandas version.
Originally we are discussing it at https://github.com/apache/spark/pull/19607 but we'd like to ask for feedback from community.


Currently we don't explicitly specify the Pandas version we are supporting but we need to decide what version we should support because:

  - There have been a number of API evolutions around extension dtypes that make supporting pandas 0.18.x and lower challenging.

  - Sometimes Pandas older than 0.19.2 doesn't handle timestamp values properly. We want to provide properer support for timestamp values.

  - If users want to use vectorized UDFs, or toPandas / createDataFrame from Pandas DataFrame with Arrow which will be released in Spark 2.3, users have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow internally, which supports only 0.19.2 or upper. 


The point I'd like to ask is:

Can we drop support old Pandas (<0.19.2)?
If not, what version should we support?


References:

- vectorized UDF
- toPandas with Arrow
- createDataFrame from pandas DataFrame with Arrow
  - https://github.com/apache/spark/pull/19646


Any comments are welcome!

Thanks.

--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin





--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin
Reply | Threaded
Open this post in threaded view
|

Re: [discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

Hyukjin Kwon
FWIW, while looking around related stuff with this discussion,
and tried to make a line chart for "Adoption rate of new updates/releases" of Pandas with PyPi (2016-01 ~ 2017-11) after
manually filtering out small numbers to make this chart looking clean:


본문 이미지 3

X is yyyymm, Y is the total download counts and colour is version.
Here is CSV - https://drive.google.com/file/d/1TKbAFehjMLKb2LixlnNuyCDBt3aW8ZIg/view?usp=sharing



2017-11-16 15:11 GMT+09:00 Takuya UESHIN <[hidden email]>:
Thanks for feedback.

Hyukjin Kwon:
> My only worry is, users who depends on lower pandas versions

That's what I worried and one of the reasons I moved this discussion here.

Li Jin:
> how complicated it is to support pandas < 0.19.2 with old non-Arrow interops

In my original PR (https://github.com/apache/spark/pull/19607) we will fix the behavior of timestamp values for Pandas.
If we need to support old Pandas, we will need at least some workarounds like in the following link:


Thanks.


On Wed, Nov 15, 2017 at 12:59 AM, Li Jin <[hidden email]> wrote:
I think this makes sense. PySpark/Pandas interops in 2.3 are new anyway, I don't think we need to support the new functionality with older version of pandas (Takuya's reason 3)

One thing I am not sure is how complicated it is to support pandas < 0.19.2 with old non-Arrow interops and require pandas >= 0.19.2 for new Arrow interops. Maybe it makes sense to allow user keep using their PySpark code if they don't want to use any of the new stuff. If this is still complicated, I would be leaning towards not supporting < 0.19.2.


On Tue, Nov 14, 2017 at 6:04 AM, Hyukjin Kwon <[hidden email]> wrote:
+0 to drop it as I said in the PR. I am seeing It brings a lot of hard time to get the cool changes through, and is slowing down them to get pushed.

My only worry is, users who depends on lower pandas versions (Pandas 0.19.2 seems released less then a year before. In the similar time, Spark 2.1.0 was released).

If this worry is less than I expected, I definitely support it. It should speed up those cool changes.


On 14 Nov 2017 7:14 pm, "Takuya UESHIN" <[hidden email]> wrote:
Hi all,

I'd like to raise a discussion about Pandas version.
Originally we are discussing it at https://github.com/apache/spark/pull/19607 but we'd like to ask for feedback from community.


Currently we don't explicitly specify the Pandas version we are supporting but we need to decide what version we should support because:

  - There have been a number of API evolutions around extension dtypes that make supporting pandas 0.18.x and lower challenging.

  - Sometimes Pandas older than 0.19.2 doesn't handle timestamp values properly. We want to provide properer support for timestamp values.

  - If users want to use vectorized UDFs, or toPandas / createDataFrame from Pandas DataFrame with Arrow which will be released in Spark 2.3, users have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow internally, which supports only 0.19.2 or upper. 


The point I'd like to ask is:

Can we drop support old Pandas (<0.19.2)?
If not, what version should we support?


References:

- vectorized UDF
- toPandas with Arrow
- createDataFrame from pandas DataFrame with Arrow
  - https://github.com/apache/spark/pull/19646


Any comments are welcome!

Thanks.

--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin





--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

Reply | Threaded
Open this post in threaded view
|

Re: [discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

Holden Karau
So this would be biased to newer versions since I imagine older versions are mostly system distributed or not newly installed.

On Thu, Nov 16, 2017 at 9:34 AM Hyukjin Kwon <[hidden email]> wrote:
FWIW, while looking around related stuff with this discussion,
and tried to make a line chart for "Adoption rate of new updates/releases" of Pandas with PyPi (2016-01 ~ 2017-11) after
manually filtering out small numbers to make this chart looking clean:


본문 이미지 3

X is yyyymm, Y is the total download counts and colour is version.
Here is CSV - https://drive.google.com/file/d/1TKbAFehjMLKb2LixlnNuyCDBt3aW8ZIg/view?usp=sharing



2017-11-16 15:11 GMT+09:00 Takuya UESHIN <[hidden email]>:
Thanks for feedback.

Hyukjin Kwon:
> My only worry is, users who depends on lower pandas versions

That's what I worried and one of the reasons I moved this discussion here.

Li Jin:
> how complicated it is to support pandas < 0.19.2 with old non-Arrow interops

In my original PR (https://github.com/apache/spark/pull/19607) we will fix the behavior of timestamp values for Pandas.
If we need to support old Pandas, we will need at least some workarounds like in the following link:


Thanks.


On Wed, Nov 15, 2017 at 12:59 AM, Li Jin <[hidden email]> wrote:
I think this makes sense. PySpark/Pandas interops in 2.3 are new anyway, I don't think we need to support the new functionality with older version of pandas (Takuya's reason 3)

One thing I am not sure is how complicated it is to support pandas < 0.19.2 with old non-Arrow interops and require pandas >= 0.19.2 for new Arrow interops. Maybe it makes sense to allow user keep using their PySpark code if they don't want to use any of the new stuff. If this is still complicated, I would be leaning towards not supporting < 0.19.2.


On Tue, Nov 14, 2017 at 6:04 AM, Hyukjin Kwon <[hidden email]> wrote:
+0 to drop it as I said in the PR. I am seeing It brings a lot of hard time to get the cool changes through, and is slowing down them to get pushed.

My only worry is, users who depends on lower pandas versions (Pandas 0.19.2 seems released less then a year before. In the similar time, Spark 2.1.0 was released).

If this worry is less than I expected, I definitely support it. It should speed up those cool changes.


On 14 Nov 2017 7:14 pm, "Takuya UESHIN" <[hidden email]> wrote:
Hi all,

I'd like to raise a discussion about Pandas version.
Originally we are discussing it at https://github.com/apache/spark/pull/19607 but we'd like to ask for feedback from community.


Currently we don't explicitly specify the Pandas version we are supporting but we need to decide what version we should support because:

  - There have been a number of API evolutions around extension dtypes that make supporting pandas 0.18.x and lower challenging.

  - Sometimes Pandas older than 0.19.2 doesn't handle timestamp values properly. We want to provide properer support for timestamp values.

  - If users want to use vectorized UDFs, or toPandas / createDataFrame from Pandas DataFrame with Arrow which will be released in Spark 2.3, users have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow internally, which supports only 0.19.2 or upper. 


The point I'd like to ask is:

Can we drop support old Pandas (<0.19.2)?
If not, what version should we support?


References:

- vectorized UDF
- toPandas with Arrow
- createDataFrame from pandas DataFrame with Arrow
  - https://github.com/apache/spark/pull/19646


Any comments are welcome!

Thanks.

--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin





--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

--
Reply | Threaded
Open this post in threaded view
|

Re: [discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

Hyukjin Kwon
Hi dev, more thoughts on this?

2017-11-16 20:52 GMT+09:00 Holden Karau <[hidden email]>:
So this would be biased to newer versions since I imagine older versions are mostly system distributed or not newly installed.

On Thu, Nov 16, 2017 at 9:34 AM Hyukjin Kwon <[hidden email]> wrote:
FWIW, while looking around related stuff with this discussion,
and tried to make a line chart for "Adoption rate of new updates/releases" of Pandas with PyPi (2016-01 ~ 2017-11) after
manually filtering out small numbers to make this chart looking clean:


본문 이미지 3

X is yyyymm, Y is the total download counts and colour is version.
Here is CSV - https://drive.google.com/file/d/1TKbAFehjMLKb2LixlnNuyCDBt3aW8ZIg/view?usp=sharing



2017-11-16 15:11 GMT+09:00 Takuya UESHIN <[hidden email]>:
Thanks for feedback.

Hyukjin Kwon:
> My only worry is, users who depends on lower pandas versions

That's what I worried and one of the reasons I moved this discussion here.

Li Jin:
> how complicated it is to support pandas < 0.19.2 with old non-Arrow interops

In my original PR (https://github.com/apache/spark/pull/19607) we will fix the behavior of timestamp values for Pandas.
If we need to support old Pandas, we will need at least some workarounds like in the following link:


Thanks.


On Wed, Nov 15, 2017 at 12:59 AM, Li Jin <[hidden email]> wrote:
I think this makes sense. PySpark/Pandas interops in 2.3 are new anyway, I don't think we need to support the new functionality with older version of pandas (Takuya's reason 3)

One thing I am not sure is how complicated it is to support pandas < 0.19.2 with old non-Arrow interops and require pandas >= 0.19.2 for new Arrow interops. Maybe it makes sense to allow user keep using their PySpark code if they don't want to use any of the new stuff. If this is still complicated, I would be leaning towards not supporting < 0.19.2.


On Tue, Nov 14, 2017 at 6:04 AM, Hyukjin Kwon <[hidden email]> wrote:
+0 to drop it as I said in the PR. I am seeing It brings a lot of hard time to get the cool changes through, and is slowing down them to get pushed.

My only worry is, users who depends on lower pandas versions (Pandas 0.19.2 seems released less then a year before. In the similar time, Spark 2.1.0 was released).

If this worry is less than I expected, I definitely support it. It should speed up those cool changes.


On 14 Nov 2017 7:14 pm, "Takuya UESHIN" <[hidden email]> wrote:
Hi all,

I'd like to raise a discussion about Pandas version.
Originally we are discussing it at https://github.com/apache/spark/pull/19607 but we'd like to ask for feedback from community.


Currently we don't explicitly specify the Pandas version we are supporting but we need to decide what version we should support because:

  - There have been a number of API evolutions around extension dtypes that make supporting pandas 0.18.x and lower challenging.

  - Sometimes Pandas older than 0.19.2 doesn't handle timestamp values properly. We want to provide properer support for timestamp values.

  - If users want to use vectorized UDFs, or toPandas / createDataFrame from Pandas DataFrame with Arrow which will be released in Spark 2.3, users have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow internally, which supports only 0.19.2 or upper. 


The point I'd like to ask is:

Can we drop support old Pandas (<0.19.2)?
If not, what version should we support?


References:

- vectorized UDF
- toPandas with Arrow
- createDataFrame from pandas DataFrame with Arrow
  - https://github.com/apache/spark/pull/19646


Any comments are welcome!

Thanks.

--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin





--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

--

Reply | Threaded
Open this post in threaded view
|

Re: [discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

Abdeali Kothari
In my opinion having a higher pandas version requirement is OK as upgrading pandas is frequently much more easier than upgrading spark. (the oldest OS I use regularly is RHEL 6)

On Nov 20, 2017 11:05, "Hyukjin Kwon" <[hidden email]> wrote:
Hi dev, more thoughts on this?

2017-11-16 20:52 GMT+09:00 Holden Karau <[hidden email]>:
So this would be biased to newer versions since I imagine older versions are mostly system distributed or not newly installed.

On Thu, Nov 16, 2017 at 9:34 AM Hyukjin Kwon <[hidden email]> wrote:
FWIW, while looking around related stuff with this discussion,
and tried to make a line chart for "Adoption rate of new updates/releases" of Pandas with PyPi (2016-01 ~ 2017-11) after
manually filtering out small numbers to make this chart looking clean:


본문 이미지 3

X is yyyymm, Y is the total download counts and colour is version.
Here is CSV - https://drive.google.com/file/d/1TKbAFehjMLKb2LixlnNuyCDBt3aW8ZIg/view?usp=sharing



2017-11-16 15:11 GMT+09:00 Takuya UESHIN <[hidden email]>:
Thanks for feedback.

Hyukjin Kwon:
> My only worry is, users who depends on lower pandas versions

That's what I worried and one of the reasons I moved this discussion here.

Li Jin:
> how complicated it is to support pandas < 0.19.2 with old non-Arrow interops

In my original PR (https://github.com/apache/spark/pull/19607) we will fix the behavior of timestamp values for Pandas.
If we need to support old Pandas, we will need at least some workarounds like in the following link:


Thanks.


On Wed, Nov 15, 2017 at 12:59 AM, Li Jin <[hidden email]> wrote:
I think this makes sense. PySpark/Pandas interops in 2.3 are new anyway, I don't think we need to support the new functionality with older version of pandas (Takuya's reason 3)

One thing I am not sure is how complicated it is to support pandas < 0.19.2 with old non-Arrow interops and require pandas >= 0.19.2 for new Arrow interops. Maybe it makes sense to allow user keep using their PySpark code if they don't want to use any of the new stuff. If this is still complicated, I would be leaning towards not supporting < 0.19.2.


On Tue, Nov 14, 2017 at 6:04 AM, Hyukjin Kwon <[hidden email]> wrote:
+0 to drop it as I said in the PR. I am seeing It brings a lot of hard time to get the cool changes through, and is slowing down them to get pushed.

My only worry is, users who depends on lower pandas versions (Pandas 0.19.2 seems released less then a year before. In the similar time, Spark 2.1.0 was released).

If this worry is less than I expected, I definitely support it. It should speed up those cool changes.


On 14 Nov 2017 7:14 pm, "Takuya UESHIN" <[hidden email]> wrote:
Hi all,

I'd like to raise a discussion about Pandas version.
Originally we are discussing it at https://github.com/apache/spark/pull/19607 but we'd like to ask for feedback from community.


Currently we don't explicitly specify the Pandas version we are supporting but we need to decide what version we should support because:

  - There have been a number of API evolutions around extension dtypes that make supporting pandas 0.18.x and lower challenging.

  - Sometimes Pandas older than 0.19.2 doesn't handle timestamp values properly. We want to provide properer support for timestamp values.

  - If users want to use vectorized UDFs, or toPandas / createDataFrame from Pandas DataFrame with Arrow which will be released in Spark 2.3, users have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow internally, which supports only 0.19.2 or upper. 


The point I'd like to ask is:

Can we drop support old Pandas (<0.19.2)?
If not, what version should we support?


References:

- vectorized UDF
- toPandas with Arrow
- createDataFrame from pandas DataFrame with Arrow
  - https://github.com/apache/spark/pull/19646


Any comments are welcome!

Thanks.

--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin





--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

--

Reply | Threaded
Open this post in threaded view
|

Re: [discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

rxin
Seems ok with me, provided that we give a good error message when the wrong version is included (eg the error message can’t be something like “function x doesn’t exist in NodeType”). 

On Mon, Nov 20, 2017 at 2:00 PM Abdeali Kothari <[hidden email]> wrote:
In my opinion having a higher pandas version requirement is OK as upgrading pandas is frequently much more easier than upgrading spark. (the oldest OS I use regularly is RHEL 6)

On Nov 20, 2017 11:05, "Hyukjin Kwon" <[hidden email]> wrote:
Hi dev, more thoughts on this?

2017-11-16 20:52 GMT+09:00 Holden Karau <[hidden email]>:
So this would be biased to newer versions since I imagine older versions are mostly system distributed or not newly installed.

On Thu, Nov 16, 2017 at 9:34 AM Hyukjin Kwon <[hidden email]> wrote:
FWIW, while looking around related stuff with this discussion,
and tried to make a line chart for "Adoption rate of new updates/releases" of Pandas with PyPi (2016-01 ~ 2017-11) after
manually filtering out small numbers to make this chart looking clean:


본문 이미지 3

X is yyyymm, Y is the total download counts and colour is version.
Here is CSV - https://drive.google.com/file/d/1TKbAFehjMLKb2LixlnNuyCDBt3aW8ZIg/view?usp=sharing



2017-11-16 15:11 GMT+09:00 Takuya UESHIN <[hidden email]>:
Thanks for feedback.

Hyukjin Kwon:
> My only worry is, users who depends on lower pandas versions

That's what I worried and one of the reasons I moved this discussion here.

Li Jin:
> how complicated it is to support pandas < 0.19.2 with old non-Arrow interops

In my original PR (https://github.com/apache/spark/pull/19607) we will fix the behavior of timestamp values for Pandas.
If we need to support old Pandas, we will need at least some workarounds like in the following link:


Thanks.


On Wed, Nov 15, 2017 at 12:59 AM, Li Jin <[hidden email]> wrote:
I think this makes sense. PySpark/Pandas interops in 2.3 are new anyway, I don't think we need to support the new functionality with older version of pandas (Takuya's reason 3)

One thing I am not sure is how complicated it is to support pandas < 0.19.2 with old non-Arrow interops and require pandas >= 0.19.2 for new Arrow interops. Maybe it makes sense to allow user keep using their PySpark code if they don't want to use any of the new stuff. If this is still complicated, I would be leaning towards not supporting < 0.19.2.


On Tue, Nov 14, 2017 at 6:04 AM, Hyukjin Kwon <[hidden email]> wrote:
+0 to drop it as I said in the PR. I am seeing It brings a lot of hard time to get the cool changes through, and is slowing down them to get pushed.

My only worry is, users who depends on lower pandas versions (Pandas 0.19.2 seems released less then a year before. In the similar time, Spark 2.1.0 was released).

If this worry is less than I expected, I definitely support it. It should speed up those cool changes.


On 14 Nov 2017 7:14 pm, "Takuya UESHIN" <[hidden email]> wrote:
Hi all,

I'd like to raise a discussion about Pandas version.
Originally we are discussing it at https://github.com/apache/spark/pull/19607 but we'd like to ask for feedback from community.


Currently we don't explicitly specify the Pandas version we are supporting but we need to decide what version we should support because:

  - There have been a number of API evolutions around extension dtypes that make supporting pandas 0.18.x and lower challenging.

  - Sometimes Pandas older than 0.19.2 doesn't handle timestamp values properly. We want to provide properer support for timestamp values.

  - If users want to use vectorized UDFs, or toPandas / createDataFrame from Pandas DataFrame with Arrow which will be released in Spark 2.3, users have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow internally, which supports only 0.19.2 or upper. 


The point I'd like to ask is:

Can we drop support old Pandas (<0.19.2)?
If not, what version should we support?


References:

- vectorized UDF
- toPandas with Arrow
- createDataFrame from pandas DataFrame with Arrow
  - https://github.com/apache/spark/pull/19646


Any comments are welcome!

Thanks.

--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin





--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

--

Reply | Threaded
Open this post in threaded view
|

Re: [discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

Takuya UESHIN
Thanks for your feedback.

Seems like we don't have any explicit objections.
I'll update the pr to remove workarounds for old Pandas but add some error messages saying we need 0.19.2 or upper.

Thanks all again!


On Mon, Nov 20, 2017 at 3:03 PM, Reynold Xin <[hidden email]> wrote:
Seems ok with me, provided that we give a good error message when the wrong version is included (eg the error message can’t be something like “function x doesn’t exist in NodeType”). 

On Mon, Nov 20, 2017 at 2:00 PM Abdeali Kothari <[hidden email]> wrote:
In my opinion having a higher pandas version requirement is OK as upgrading pandas is frequently much more easier than upgrading spark. (the oldest OS I use regularly is RHEL 6)

On Nov 20, 2017 11:05, "Hyukjin Kwon" <[hidden email]> wrote:
Hi dev, more thoughts on this?

2017-11-16 20:52 GMT+09:00 Holden Karau <[hidden email]>:
So this would be biased to newer versions since I imagine older versions are mostly system distributed or not newly installed.

On Thu, Nov 16, 2017 at 9:34 AM Hyukjin Kwon <[hidden email]> wrote:
FWIW, while looking around related stuff with this discussion,
and tried to make a line chart for "Adoption rate of new updates/releases" of Pandas with PyPi (2016-01 ~ 2017-11) after
manually filtering out small numbers to make this chart looking clean:


본문 이미지 3

X is yyyymm, Y is the total download counts and colour is version.
Here is CSV - https://drive.google.com/file/d/1TKbAFehjMLKb2LixlnNuyCDBt3aW8ZIg/view?usp=sharing



2017-11-16 15:11 GMT+09:00 Takuya UESHIN <[hidden email]>:
Thanks for feedback.

Hyukjin Kwon:
> My only worry is, users who depends on lower pandas versions

That's what I worried and one of the reasons I moved this discussion here.

Li Jin:
> how complicated it is to support pandas < 0.19.2 with old non-Arrow interops

In my original PR (https://github.com/apache/spark/pull/19607) we will fix the behavior of timestamp values for Pandas.
If we need to support old Pandas, we will need at least some workarounds like in the following link:


Thanks.


On Wed, Nov 15, 2017 at 12:59 AM, Li Jin <[hidden email]> wrote:
I think this makes sense. PySpark/Pandas interops in 2.3 are new anyway, I don't think we need to support the new functionality with older version of pandas (Takuya's reason 3)

One thing I am not sure is how complicated it is to support pandas < 0.19.2 with old non-Arrow interops and require pandas >= 0.19.2 for new Arrow interops. Maybe it makes sense to allow user keep using their PySpark code if they don't want to use any of the new stuff. If this is still complicated, I would be leaning towards not supporting < 0.19.2.


On Tue, Nov 14, 2017 at 6:04 AM, Hyukjin Kwon <[hidden email]> wrote:
+0 to drop it as I said in the PR. I am seeing It brings a lot of hard time to get the cool changes through, and is slowing down them to get pushed.

My only worry is, users who depends on lower pandas versions (Pandas 0.19.2 seems released less then a year before. In the similar time, Spark 2.1.0 was released).

If this worry is less than I expected, I definitely support it. It should speed up those cool changes.


On 14 Nov 2017 7:14 pm, "Takuya UESHIN" <[hidden email]> wrote:
Hi all,

I'd like to raise a discussion about Pandas version.
Originally we are discussing it at https://github.com/apache/spark/pull/19607 but we'd like to ask for feedback from community.


Currently we don't explicitly specify the Pandas version we are supporting but we need to decide what version we should support because:

  - There have been a number of API evolutions around extension dtypes that make supporting pandas 0.18.x and lower challenging.

  - Sometimes Pandas older than 0.19.2 doesn't handle timestamp values properly. We want to provide properer support for timestamp values.

  - If users want to use vectorized UDFs, or toPandas / createDataFrame from Pandas DataFrame with Arrow which will be released in Spark 2.3, users have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow internally, which supports only 0.19.2 or upper. 


The point I'd like to ask is:

Can we drop support old Pandas (<0.19.2)?
If not, what version should we support?


References:

- vectorized UDF
- toPandas with Arrow
- createDataFrame from pandas DataFrame with Arrow
  - https://github.com/apache/spark/pull/19646


Any comments are welcome!

Thanks.

--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin





--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

--




--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin