[DISCUSS] Increasing minimum supported version of Pandas

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Increasing minimum supported version of Pandas

Bryan Cutler
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in PySpark that could be removed if such an old version is not required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to bump the version to 0.23.2, but we would like to discuss before making a change. Does anyone else have thoughts on this?

Regards,
Bryan
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Increasing minimum supported version of Pandas

Hyukjin Kwon
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler <[hidden email]>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in PySpark that could be removed if such an old version is not required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to bump the version to 0.23.2, but we would like to discuss before making a change. Does anyone else have thoughts on this?

Regards,
Bryan
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Increasing minimum supported version of Pandas

Holden Karau
I’m +1 for upgrading, although since this is probably the last easy chance we’ll have to bump version numbers easily I’d suggest 0.24.2 


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon <[hidden email]> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler <[hidden email]>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in PySpark that could be removed if such an old version is not required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to bump the version to 0.23.2, but we would like to discuss before making a change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Increasing minimum supported version of Pandas

Dongjoon Hyun-2
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau <[hidden email]> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance we’ll have to bump version numbers easily I’d suggest 0.24.2 


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon <[hidden email]> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler <[hidden email]>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in PySpark that could be removed if such an old version is not required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to bump the version to 0.23.2, but we would like to discuss before making a change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Increasing minimum supported version of Pandas

shane knapp
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <[hidden email]> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau <[hidden email]> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance we’ll have to bump version numbers easily I’d suggest 0.24.2 


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon <[hidden email]> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler <[hidden email]>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in PySpark that could be removed if such an old version is not required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to bump the version to 0.23.2, but we would like to discuss before making a change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Increasing minimum supported version of Pandas

Bryan Cutler
I should have stated this earlier, but when the user does something that requires Pandas, the minimum version is checked against what was imported and will raise an exception if it is a lower version. So I'm concerned that using 0.24.2 might be a little too new for users running older clusters. To give some release dates, 0.23.2 was released about a year ago, 0.24.0 in January and 0.24.2 in March.

On Fri, Jun 14, 2019 at 9:27 AM shane knapp <[hidden email]> wrote:
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <[hidden email]> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau <[hidden email]> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance we’ll have to bump version numbers easily I’d suggest 0.24.2 


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon <[hidden email]> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler <[hidden email]>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in PySpark that could be removed if such an old version is not required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to bump the version to 0.23.2, but we would like to discuss before making a change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Increasing minimum supported version of Pandas

shane knapp
ah, ok...  should we downgrade the testing env on jenkins then?  any specific version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler <[hidden email]> wrote:
I should have stated this earlier, but when the user does something that requires Pandas, the minimum version is checked against what was imported and will raise an exception if it is a lower version. So I'm concerned that using 0.24.2 might be a little too new for users running older clusters. To give some release dates, 0.23.2 was released about a year ago, 0.24.0 in January and 0.24.2 in March.

On Fri, Jun 14, 2019 at 9:27 AM shane knapp <[hidden email]> wrote:
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <[hidden email]> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau <[hidden email]> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance we’ll have to bump version numbers easily I’d suggest 0.24.2 


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon <[hidden email]> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler <[hidden email]>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in PySpark that could be removed if such an old version is not required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to bump the version to 0.23.2, but we would like to discuss before making a change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Increasing minimum supported version of Pandas

Bryan Cutler
Shane, I think 0.24.2 is probably more common right now, so if we were to pick one to test against, I still think it should be that one. Our Pandas usage in PySpark is pretty conservative, so it's pretty unlikely that we will add something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp <[hidden email]> wrote:
ah, ok...  should we downgrade the testing env on jenkins then?  any specific version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler <[hidden email]> wrote:
I should have stated this earlier, but when the user does something that requires Pandas, the minimum version is checked against what was imported and will raise an exception if it is a lower version. So I'm concerned that using 0.24.2 might be a little too new for users running older clusters. To give some release dates, 0.23.2 was released about a year ago, 0.24.0 in January and 0.24.2 in March.

On Fri, Jun 14, 2019 at 9:27 AM shane knapp <[hidden email]> wrote:
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <[hidden email]> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau <[hidden email]> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance we’ll have to bump version numbers easily I’d suggest 0.24.2 


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon <[hidden email]> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler <[hidden email]>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in PySpark that could be removed if such an old version is not required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to bump the version to 0.23.2, but we would like to discuss before making a change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Increasing minimum supported version of Pandas

shane knapp
excellent.  i shall not touch anything.  :)

On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler <[hidden email]> wrote:
Shane, I think 0.24.2 is probably more common right now, so if we were to pick one to test against, I still think it should be that one. Our Pandas usage in PySpark is pretty conservative, so it's pretty unlikely that we will add something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp <[hidden email]> wrote:
ah, ok...  should we downgrade the testing env on jenkins then?  any specific version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler <[hidden email]> wrote:
I should have stated this earlier, but when the user does something that requires Pandas, the minimum version is checked against what was imported and will raise an exception if it is a lower version. So I'm concerned that using 0.24.2 might be a little too new for users running older clusters. To give some release dates, 0.23.2 was released about a year ago, 0.24.0 in January and 0.24.2 in March.

On Fri, Jun 14, 2019 at 9:27 AM shane knapp <[hidden email]> wrote:
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <[hidden email]> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau <[hidden email]> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance we’ll have to bump version numbers easily I’d suggest 0.24.2 


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon <[hidden email]> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler <[hidden email]>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in PySpark that could be removed if such an old version is not required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to bump the version to 0.23.2, but we would like to discuss before making a change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Increasing minimum supported version of Pandas

Felix Cheung
So to be clear, min version check is 0.23
Jenkins test is 0.24

I’m ok with this. I hope someone will test 0.23 on releases though before we sign off?


From: shane knapp <[hidden email]>
Sent: Friday, June 14, 2019 10:23:56 AM
To: Bryan Cutler
Cc: Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas
 
excellent.  i shall not touch anything.  :)

On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler <[hidden email]> wrote:
Shane, I think 0.24.2 is probably more common right now, so if we were to pick one to test against, I still think it should be that one. Our Pandas usage in PySpark is pretty conservative, so it's pretty unlikely that we will add something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp <[hidden email]> wrote:
ah, ok...  should we downgrade the testing env on jenkins then?  any specific version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler <[hidden email]> wrote:
I should have stated this earlier, but when the user does something that requires Pandas, the minimum version is checked against what was imported and will raise an exception if it is a lower version. So I'm concerned that using 0.24.2 might be a little too new for users running older clusters. To give some release dates, 0.23.2 was released about a year ago, 0.24.0 in January and 0.24.2 in March.

On Fri, Jun 14, 2019 at 9:27 AM shane knapp <[hidden email]> wrote:
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <[hidden email]> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau <[hidden email]> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance we’ll have to bump version numbers easily I’d suggest 0.24.2 


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon <[hidden email]> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler <[hidden email]>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in PySpark that could be removed if such an old version is not required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to bump the version to 0.23.2, but we would like to discuss before making a change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Increasing minimum supported version of Pandas

Holden Karau
Are there other Python dependencies we should consider upgrading at the same time?

On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung <[hidden email]> wrote:
So to be clear, min version check is 0.23
Jenkins test is 0.24

I’m ok with this. I hope someone will test 0.23 on releases though before we sign off?
We should maybe add this to the release instruction notes?


From: shane knapp <[hidden email]>
Sent: Friday, June 14, 2019 10:23:56 AM
To: Bryan Cutler
Cc: Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas
 
excellent.  i shall not touch anything.  :)

On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler <[hidden email]> wrote:
Shane, I think 0.24.2 is probably more common right now, so if we were to pick one to test against, I still think it should be that one. Our Pandas usage in PySpark is pretty conservative, so it's pretty unlikely that we will add something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp <[hidden email]> wrote:
ah, ok...  should we downgrade the testing env on jenkins then?  any specific version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler <[hidden email]> wrote:
I should have stated this earlier, but when the user does something that requires Pandas, the minimum version is checked against what was imported and will raise an exception if it is a lower version. So I'm concerned that using 0.24.2 might be a little too new for users running older clusters. To give some release dates, 0.23.2 was released about a year ago, 0.24.0 in January and 0.24.2 in March.
I think given that we’re switching to requiring Python 3 and also a bit of a way from cutting a release 0.24 could be Ok as a min version requirement


On Fri, Jun 14, 2019 at 9:27 AM shane knapp <[hidden email]> wrote:
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <[hidden email]> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau <[hidden email]> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance we’ll have to bump version numbers easily I’d suggest 0.24.2 


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon <[hidden email]> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler <[hidden email]>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in PySpark that could be removed if such an old version is not required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to bump the version to 0.23.2, but we would like to discuss before making a change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Increasing minimum supported version of Pandas

Felix Cheung
How about pyArrow?


From: Holden Karau <[hidden email]>
Sent: Friday, June 14, 2019 11:06:15 AM
To: Felix Cheung
Cc: Bryan Cutler; Dongjoon Hyun; Hyukjin Kwon; dev; shane knapp
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas
 
Are there other Python dependencies we should consider upgrading at the same time?

On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung <[hidden email]> wrote:
So to be clear, min version check is 0.23
Jenkins test is 0.24

I’m ok with this. I hope someone will test 0.23 on releases though before we sign off?
We should maybe add this to the release instruction notes?


From: shane knapp <[hidden email]>
Sent: Friday, June 14, 2019 10:23:56 AM
To: Bryan Cutler
Cc: Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas
 
excellent.  i shall not touch anything.  :)

On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler <[hidden email]> wrote:
Shane, I think 0.24.2 is probably more common right now, so if we were to pick one to test against, I still think it should be that one. Our Pandas usage in PySpark is pretty conservative, so it's pretty unlikely that we will add something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp <[hidden email]> wrote:
ah, ok...  should we downgrade the testing env on jenkins then?  any specific version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler <[hidden email]> wrote:
I should have stated this earlier, but when the user does something that requires Pandas, the minimum version is checked against what was imported and will raise an exception if it is a lower version. So I'm concerned that using 0.24.2 might be a little too new for users running older clusters. To give some release dates, 0.23.2 was released about a year ago, 0.24.0 in January and 0.24.2 in March.
I think given that we’re switching to requiring Python 3 and also a bit of a way from cutting a release 0.24 could be Ok as a min version requirement


On Fri, Jun 14, 2019 at 9:27 AM shane knapp <[hidden email]> wrote:
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <[hidden email]> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau <[hidden email]> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance we’ll have to bump version numbers easily I’d suggest 0.24.2 


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon <[hidden email]> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler <[hidden email]>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in PySpark that could be removed if such an old version is not required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to bump the version to 0.23.2, but we would like to discuss before making a change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Increasing minimum supported version of Pandas

Bryan Cutler
Yeah, PyArrow is the only other PySpark dependency we check for a minimum version. We updated that not too long ago to be 0.12.1, which I think we are still good on for now.

On Fri, Jun 14, 2019 at 11:36 AM Felix Cheung <[hidden email]> wrote:
How about pyArrow?


From: Holden Karau <[hidden email]>
Sent: Friday, June 14, 2019 11:06:15 AM
To: Felix Cheung
Cc: Bryan Cutler; Dongjoon Hyun; Hyukjin Kwon; dev; shane knapp
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas
 
Are there other Python dependencies we should consider upgrading at the same time?

On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung <[hidden email]> wrote:
So to be clear, min version check is 0.23
Jenkins test is 0.24

I’m ok with this. I hope someone will test 0.23 on releases though before we sign off?
We should maybe add this to the release instruction notes?


From: shane knapp <[hidden email]>
Sent: Friday, June 14, 2019 10:23:56 AM
To: Bryan Cutler
Cc: Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas
 
excellent.  i shall not touch anything.  :)

On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler <[hidden email]> wrote:
Shane, I think 0.24.2 is probably more common right now, so if we were to pick one to test against, I still think it should be that one. Our Pandas usage in PySpark is pretty conservative, so it's pretty unlikely that we will add something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp <[hidden email]> wrote:
ah, ok...  should we downgrade the testing env on jenkins then?  any specific version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler <[hidden email]> wrote:
I should have stated this earlier, but when the user does something that requires Pandas, the minimum version is checked against what was imported and will raise an exception if it is a lower version. So I'm concerned that using 0.24.2 might be a little too new for users running older clusters. To give some release dates, 0.23.2 was released about a year ago, 0.24.0 in January and 0.24.2 in March.
I think given that we’re switching to requiring Python 3 and also a bit of a way from cutting a release 0.24 could be Ok as a min version requirement


On Fri, Jun 14, 2019 at 9:27 AM shane knapp <[hidden email]> wrote:
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <[hidden email]> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau <[hidden email]> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance we’ll have to bump version numbers easily I’d suggest 0.24.2 


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon <[hidden email]> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler <[hidden email]>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in PySpark that could be removed if such an old version is not required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to bump the version to 0.23.2, but we would like to discuss before making a change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Increasing minimum supported version of Pandas

Hyukjin Kwon
Oh btw, why is it 0.23.2, not 0.23.0 or 0.23.4?

On Sat, 15 Jun 2019, 06:56 Bryan Cutler, <[hidden email]> wrote:
Yeah, PyArrow is the only other PySpark dependency we check for a minimum version. We updated that not too long ago to be 0.12.1, which I think we are still good on for now.

On Fri, Jun 14, 2019 at 11:36 AM Felix Cheung <[hidden email]> wrote:
How about pyArrow?


From: Holden Karau <[hidden email]>
Sent: Friday, June 14, 2019 11:06:15 AM
To: Felix Cheung
Cc: Bryan Cutler; Dongjoon Hyun; Hyukjin Kwon; dev; shane knapp
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas
 
Are there other Python dependencies we should consider upgrading at the same time?

On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung <[hidden email]> wrote:
So to be clear, min version check is 0.23
Jenkins test is 0.24

I’m ok with this. I hope someone will test 0.23 on releases though before we sign off?
We should maybe add this to the release instruction notes?


From: shane knapp <[hidden email]>
Sent: Friday, June 14, 2019 10:23:56 AM
To: Bryan Cutler
Cc: Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas
 
excellent.  i shall not touch anything.  :)

On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler <[hidden email]> wrote:
Shane, I think 0.24.2 is probably more common right now, so if we were to pick one to test against, I still think it should be that one. Our Pandas usage in PySpark is pretty conservative, so it's pretty unlikely that we will add something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp <[hidden email]> wrote:
ah, ok...  should we downgrade the testing env on jenkins then?  any specific version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler <[hidden email]> wrote:
I should have stated this earlier, but when the user does something that requires Pandas, the minimum version is checked against what was imported and will raise an exception if it is a lower version. So I'm concerned that using 0.24.2 might be a little too new for users running older clusters. To give some release dates, 0.23.2 was released about a year ago, 0.24.0 in January and 0.24.2 in March.
I think given that we’re switching to requiring Python 3 and also a bit of a way from cutting a release 0.24 could be Ok as a min version requirement


On Fri, Jun 14, 2019 at 9:27 AM shane knapp <[hidden email]> wrote:
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <[hidden email]> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau <[hidden email]> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance we’ll have to bump version numbers easily I’d suggest 0.24.2 


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon <[hidden email]> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler <[hidden email]>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in PySpark that could be removed if such an old version is not required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to bump the version to 0.23.2, but we would like to discuss before making a change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Increasing minimum supported version of Pandas

shane knapp
In reply to this post by Bryan Cutler
pyarrow is currently testing against 0.12.1.

On Fri, Jun 14, 2019 at 2:56 PM Bryan Cutler <[hidden email]> wrote:
Yeah, PyArrow is the only other PySpark dependency we check for a minimum version. We updated that not too long ago to be 0.12.1, which I think we are still good on for now.

On Fri, Jun 14, 2019 at 11:36 AM Felix Cheung <[hidden email]> wrote:
How about pyArrow?


From: Holden Karau <[hidden email]>
Sent: Friday, June 14, 2019 11:06:15 AM
To: Felix Cheung
Cc: Bryan Cutler; Dongjoon Hyun; Hyukjin Kwon; dev; shane knapp
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas
 
Are there other Python dependencies we should consider upgrading at the same time?

On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung <[hidden email]> wrote:
So to be clear, min version check is 0.23
Jenkins test is 0.24

I’m ok with this. I hope someone will test 0.23 on releases though before we sign off?
We should maybe add this to the release instruction notes?


From: shane knapp <[hidden email]>
Sent: Friday, June 14, 2019 10:23:56 AM
To: Bryan Cutler
Cc: Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas
 
excellent.  i shall not touch anything.  :)

On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler <[hidden email]> wrote:
Shane, I think 0.24.2 is probably more common right now, so if we were to pick one to test against, I still think it should be that one. Our Pandas usage in PySpark is pretty conservative, so it's pretty unlikely that we will add something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp <[hidden email]> wrote:
ah, ok...  should we downgrade the testing env on jenkins then?  any specific version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler <[hidden email]> wrote:
I should have stated this earlier, but when the user does something that requires Pandas, the minimum version is checked against what was imported and will raise an exception if it is a lower version. So I'm concerned that using 0.24.2 might be a little too new for users running older clusters. To give some release dates, 0.23.2 was released about a year ago, 0.24.0 in January and 0.24.2 in March.
I think given that we’re switching to requiring Python 3 and also a bit of a way from cutting a release 0.24 could be Ok as a min version requirement


On Fri, Jun 14, 2019 at 9:27 AM shane knapp <[hidden email]> wrote:
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <[hidden email]> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau <[hidden email]> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance we’ll have to bump version numbers easily I’d suggest 0.24.2 


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon <[hidden email]> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler <[hidden email]>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in PySpark that could be removed if such an old version is not required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to bump the version to 0.23.2, but we would like to discuss before making a change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead