Re: Should python-2 be supported in Spark 3.0?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

Nicholas Chammas
As Reynold pointed out, we don't have to drop Python 2 support right off the bat. We can just deprecate it with Spark 3.0, which would allow us to actually drop it at a later 3.x release.

On Sat, Sep 15, 2018 at 2:09 PM Erik Erlandson <[hidden email]> wrote:
On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0.

Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it is a good time to consider support for Python-2 on PySpark.

Key advantages to dropping Python 2 are:
  • Support for PySpark becomes significantly easier.
  • Avoid having to support Python 2 until Spark 4.0, which is likely to imply supporting Python 2 for some time after it goes EOL.
(Note that supporting python 2 after EOL means, among other things, that PySpark would be supporting a version of python that was no longer receiving security patches)

The main disadvantage is that PySpark users who have legacy python-2 code would have to migrate their code to python 3 to take advantage of Spark 3.0

This decision obviously has large implications for the Apache Spark community and we want to solicit community feedback.

Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

Mark Hamstra
If we're going to do that, then we need to do it right now, since 2.4.0 is already in release candidates.

On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson <[hidden email]> wrote:
I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem like a ways off but even now there may be some spark versions supporting Py2 past the point where Py2 is no longer receiving security patches 


On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra <[hidden email]> wrote:
We could also deprecate Py2 already in the 2.4.0 release.

On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson <[hidden email]> wrote:
In case this didn't make it onto this thread:

There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it entirely on a later 3.x release.

On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson <[hidden email]> wrote:
On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0.

Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it is a good time to consider support for Python-2 on PySpark.

Key advantages to dropping Python 2 are:
  • Support for PySpark becomes significantly easier.
  • Avoid having to support Python 2 until Spark 4.0, which is likely to imply supporting Python 2 for some time after it goes EOL.
(Note that supporting python 2 after EOL means, among other things, that PySpark would be supporting a version of python that was no longer receiving security patches)

The main disadvantage is that PySpark users who have legacy python-2 code would have to migrate their code to python 3 to take advantage of Spark 3.0

This decision obviously has large implications for the Apache Spark community and we want to solicit community feedback.


Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

rxin
i'd like to second that.

if we want to communicate timeline, we can add to the release notes saying py2 will be deprecated in 3.0, and removed in a 3.x release.

--
excuse the brevity and lower case due to wrist injury


On Mon, Sep 17, 2018 at 4:24 PM Matei Zaharia <[hidden email]> wrote:
That’s a good point — I’d say there’s just a risk of creating a perception issue. First, some users might feel that this means they have to migrate now, which is before Python itself drops support; they might also be surprised that we did this in a minor release (e.g. might we drop Python 2 altogether in a Spark 2.5 if that later comes out?). Second, contributors might feel that this means new features no longer have to work with Python 2, which would be confusing. Maybe it’s OK on both fronts, but it just seems scarier for users to do this now if we do plan to have Spark 3.0 in the next 6 months anyway.

Matei

> On Sep 17, 2018, at 1:04 PM, Mark Hamstra <[hidden email]> wrote:
>
> What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't change the code at all; it's just a notification that we will eventually cease supporting Py2. Wouldn't users prefer to get that notification sooner rather than later?
>
> On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia <[hidden email]> wrote:
> I’d like to understand the maintenance burden of Python 2 before deprecating it. Since it is not EOL yet, it might make sense to only deprecate it once it’s EOL (which is still over a year from now). Supporting Python 2+3 seems less burdensome than supporting, say, multiple Scala versions in the same codebase, so what are we losing out?
>
> The other thing is that even though Python core devs might not support 2.x later, it’s quite possible that various Linux distros will if moving from 2 to 3 remains painful. In that case, we may want Apache Spark to continue releasing for it despite the Python core devs not supporting it.
>
> Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it later in 3.x instead of deprecating it in 2.4. I’d also consider looking at what other data science tools are doing before fully removing it: for example, if Pandas and TensorFlow no longer support Python 2 past some point, that might be a good point to remove it.
>
> Matei
>
> > On Sep 17, 2018, at 11:01 AM, Mark Hamstra <[hidden email]> wrote:
> >
> > If we're going to do that, then we need to do it right now, since 2.4.0 is already in release candidates.
> >
> > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson <[hidden email]> wrote:
> > I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem like a ways off but even now there may be some spark versions supporting Py2 past the point where Py2 is no longer receiving security patches
> >
> >
> > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra <[hidden email]> wrote:
> > We could also deprecate Py2 already in the 2.4.0 release.
> >
> > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson <[hidden email]> wrote:
> > In case this didn't make it onto this thread:
> >
> > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it entirely on a later 3.x release.
> >
> > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson <[hidden email]> wrote:
> > On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0.
> >
> > Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it is a good time to consider support for Python-2 on PySpark.
> >
> > Key advantages to dropping Python 2 are:
> >       • Support for PySpark becomes significantly easier.
> >       • Avoid having to support Python 2 until Spark 4.0, which is likely to imply supporting Python 2 for some time after it goes EOL.
> > (Note that supporting python 2 after EOL means, among other things, that PySpark would be supporting a version of python that was no longer receiving security patches)
> >
> > The main disadvantage is that PySpark users who have legacy python-2 code would have to migrate their code to python 3 to take advantage of Spark 3.0
> >
> > This decision obviously has large implications for the Apache Spark community and we want to solicit community feedback.
> >
> >
>

Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

Erik Erlandson-2

I think that makes sense. The main benefit of deprecating *prior* to 3.0 would be informational - making the community aware of the upcoming transition earlier. But there are other ways to start informing the community between now and 3.0, besides formal deprecation.

I have some residual curiosity about what it might mean for a release like 2.4 to still be in its support lifetime after Py2 goes EOL. I asked Apache Legal to comment. It is possible there are no issues with this at all.


On Mon, Sep 17, 2018 at 4:26 PM, Reynold Xin <[hidden email]> wrote:
i'd like to second that.

if we want to communicate timeline, we can add to the release notes saying py2 will be deprecated in 3.0, and removed in a 3.x release.

--
excuse the brevity and lower case due to wrist injury


On Mon, Sep 17, 2018 at 4:24 PM Matei Zaharia <[hidden email]> wrote:
That’s a good point — I’d say there’s just a risk of creating a perception issue. First, some users might feel that this means they have to migrate now, which is before Python itself drops support; they might also be surprised that we did this in a minor release (e.g. might we drop Python 2 altogether in a Spark 2.5 if that later comes out?). Second, contributors might feel that this means new features no longer have to work with Python 2, which would be confusing. Maybe it’s OK on both fronts, but it just seems scarier for users to do this now if we do plan to have Spark 3.0 in the next 6 months anyway.

Matei

> On Sep 17, 2018, at 1:04 PM, Mark Hamstra <[hidden email]> wrote:
>
> What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't change the code at all; it's just a notification that we will eventually cease supporting Py2. Wouldn't users prefer to get that notification sooner rather than later?
>
> On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia <[hidden email]> wrote:
> I’d like to understand the maintenance burden of Python 2 before deprecating it. Since it is not EOL yet, it might make sense to only deprecate it once it’s EOL (which is still over a year from now). Supporting Python 2+3 seems less burdensome than supporting, say, multiple Scala versions in the same codebase, so what are we losing out?
>
> The other thing is that even though Python core devs might not support 2.x later, it’s quite possible that various Linux distros will if moving from 2 to 3 remains painful. In that case, we may want Apache Spark to continue releasing for it despite the Python core devs not supporting it.
>
> Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it later in 3.x instead of deprecating it in 2.4. I’d also consider looking at what other data science tools are doing before fully removing it: for example, if Pandas and TensorFlow no longer support Python 2 past some point, that might be a good point to remove it.
>
> Matei
>
> > On Sep 17, 2018, at 11:01 AM, Mark Hamstra <[hidden email]> wrote:
> >
> > If we're going to do that, then we need to do it right now, since 2.4.0 is already in release candidates.
> >
> > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson <[hidden email]> wrote:
> > I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem like a ways off but even now there may be some spark versions supporting Py2 past the point where Py2 is no longer receiving security patches
> >
> >
> > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra <[hidden email]> wrote:
> > We could also deprecate Py2 already in the 2.4.0 release.
> >
> > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson <[hidden email]> wrote:
> > In case this didn't make it onto this thread:
> >
> > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it entirely on a later 3.x release.
> >
> > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson <[hidden email]> wrote:
> > On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0.
> >
> > Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it is a good time to consider support for Python-2 on PySpark.
> >
> > Key advantages to dropping Python 2 are:
> >       • Support for PySpark becomes significantly easier.
> >       • Avoid having to support Python 2 until Spark 4.0, which is likely to imply supporting Python 2 for some time after it goes EOL.
> > (Note that supporting python 2 after EOL means, among other things, that PySpark would be supporting a version of python that was no longer receiving security patches)
> >
> > The main disadvantage is that PySpark users who have legacy python-2 code would have to migrate their code to python 3 to take advantage of Spark 3.0
> >
> > This decision obviously has large implications for the Apache Spark community and we want to solicit community feedback.
> >
> >
>


Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

Xiangrui Meng
Hi all,

I want to revive this old thread since no action was taken so far. If we plan to mark Python 2 as deprecated in Spark 3.0, we should do it as early as possible and let users know ahead. PySpark depends on Python, numpy, pandas, and pyarrow, all of which are sunsetting Python 2 support by 2020/01/01 per https://python3statement.org/. At that time we cannot really support Python 2 because the dependent libraries do not plan to make new releases, even for security reasons. So I suggest the following:

1. Update Spark website and state that Python 2 is deprecated in Spark 3.0 and its support will be removed in a release after 2020/01/01.
2. Make a formal announcement to dev@ and users@.
3. Add Apache Spark project to https://python3statement.org/ timeline.
4. Update PySpark, check python version and print a deprecation warning if version < 3.

Any thoughts and suggestions?

Best,
Xiangrui

On Mon, Sep 17, 2018 at 6:54 PM Erik Erlandson <[hidden email]> wrote:

I think that makes sense. The main benefit of deprecating *prior* to 3.0 would be informational - making the community aware of the upcoming transition earlier. But there are other ways to start informing the community between now and 3.0, besides formal deprecation.

I have some residual curiosity about what it might mean for a release like 2.4 to still be in its support lifetime after Py2 goes EOL. I asked Apache Legal to comment. It is possible there are no issues with this at all.


On Mon, Sep 17, 2018 at 4:26 PM, Reynold Xin <[hidden email]> wrote:
i'd like to second that.

if we want to communicate timeline, we can add to the release notes saying py2 will be deprecated in 3.0, and removed in a 3.x release.

--
excuse the brevity and lower case due to wrist injury


On Mon, Sep 17, 2018 at 4:24 PM Matei Zaharia <[hidden email]> wrote:
That’s a good point — I’d say there’s just a risk of creating a perception issue. First, some users might feel that this means they have to migrate now, which is before Python itself drops support; they might also be surprised that we did this in a minor release (e.g. might we drop Python 2 altogether in a Spark 2.5 if that later comes out?). Second, contributors might feel that this means new features no longer have to work with Python 2, which would be confusing. Maybe it’s OK on both fronts, but it just seems scarier for users to do this now if we do plan to have Spark 3.0 in the next 6 months anyway.

Matei

> On Sep 17, 2018, at 1:04 PM, Mark Hamstra <[hidden email]> wrote:
>
> What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't change the code at all; it's just a notification that we will eventually cease supporting Py2. Wouldn't users prefer to get that notification sooner rather than later?
>
> On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia <[hidden email]> wrote:
> I’d like to understand the maintenance burden of Python 2 before deprecating it. Since it is not EOL yet, it might make sense to only deprecate it once it’s EOL (which is still over a year from now). Supporting Python 2+3 seems less burdensome than supporting, say, multiple Scala versions in the same codebase, so what are we losing out?
>
> The other thing is that even though Python core devs might not support 2.x later, it’s quite possible that various Linux distros will if moving from 2 to 3 remains painful. In that case, we may want Apache Spark to continue releasing for it despite the Python core devs not supporting it.
>
> Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it later in 3.x instead of deprecating it in 2.4. I’d also consider looking at what other data science tools are doing before fully removing it: for example, if Pandas and TensorFlow no longer support Python 2 past some point, that might be a good point to remove it.
>
> Matei
>
> > On Sep 17, 2018, at 11:01 AM, Mark Hamstra <[hidden email]> wrote:
> >
> > If we're going to do that, then we need to do it right now, since 2.4.0 is already in release candidates.
> >
> > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson <[hidden email]> wrote:
> > I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem like a ways off but even now there may be some spark versions supporting Py2 past the point where Py2 is no longer receiving security patches
> >
> >
> > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra <[hidden email]> wrote:
> > We could also deprecate Py2 already in the 2.4.0 release.
> >
> > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson <[hidden email]> wrote:
> > In case this didn't make it onto this thread:
> >
> > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it entirely on a later 3.x release.
> >
> > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson <[hidden email]> wrote:
> > On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0.
> >
> > Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it is a good time to consider support for Python-2 on PySpark.
> >
> > Key advantages to dropping Python 2 are:
> >       • Support for PySpark becomes significantly easier.
> >       • Avoid having to support Python 2 until Spark 4.0, which is likely to imply supporting Python 2 for some time after it goes EOL.
> > (Note that supporting python 2 after EOL means, among other things, that PySpark would be supporting a version of python that was no longer receiving security patches)
> >
> > The main disadvantage is that PySpark users who have legacy python-2 code would have to migrate their code to python 3 to take advantage of Spark 3.0
> >
> > This decision obviously has large implications for the Apache Spark community and we want to solicit community feedback.
> >
> >
>


Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

shane knapp
I don't have a good sense of the overhead of continuing to support
Python 2; is it large enough to consider dropping it in Spark 3.0?

from the build/test side, it will actually be pretty easy to continue support for python2.7 for spark 2.x as the feature sets won't be expanding.

that being said, i will be cracking a bottle of champagne when i can delete all of the ansible and anaconda configs for python2.x.  :)

shane
--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

rxin
+1 on Xiangrui’s plan.

On Thu, May 30, 2019 at 7:55 AM shane knapp <[hidden email]> wrote:
I don't have a good sense of the overhead of continuing to support
Python 2; is it large enough to consider dropping it in Spark 3.0?

from the build/test side, it will actually be pretty easy to continue support for python2.7 for spark 2.x as the feature sets won't be expanding.

that being said, i will be cracking a bottle of champagne when i can delete all of the ansible and anaconda configs for python2.x.  :)

shane
--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

Bryan Cutler
+1 and the draft sounds good

On Thu, May 30, 2019, 11:32 AM Xiangrui Meng <[hidden email]> wrote:
Here is the draft announcement:

===
Plan for dropping Python 2 support

As many of you already knew, Python core development team and many utilized Python packages like Pandas and NumPy will drop Python 2 support in or before 2020/01/01. Apache Spark has supported both Python 2 and 3 since Spark 1.4 release in 2015. However, maintaining Python 2/3 compatibility is an increasing burden and it essentially limits the use of Python 3 features in Spark. Given the end of life (EOL) of Python 2 is coming, we plan to eventually drop Python 2 support as well. The current plan is as follows:

* In the next major release in 2019, we will deprecate Python 2 support. PySpark users will see a deprecation warning if Python 2 is used. We will publish a migration guide for PySpark users to migrate to Python 3.
* We will drop Python 2 support in a future release in 2020, after Python 2 EOL on 2020/01/01. PySpark users will see an error if Python 2 is used.
* For releases that support Python 2, e.g., Spark 2.4, their patch releases will continue supporting Python 2. However, after Python 2 EOL, we might not take patches that are specific to Python 2.
===

Sean helped make a pass. If it looks good, I'm going to upload it to Spark website and announce it here. Let me know if you think we should do a VOTE instead.

On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng <[hidden email]> wrote:

On Thu, May 30, 2019 at 2:18 AM Felix Cheung <[hidden email]> wrote:
We don’t usually reference a future release on website

> Spark website and state that Python 2 is deprecated in Spark 3.0

I suspect people will then ask when is Spark 3.0 coming out then. Might need to provide some clarity on that.

We can say the "next major release in 2019" instead of Spark 3.0. Spark 3.0 timeline certainly requires a new thread to discuss.
 



From: Reynold Xin <[hidden email]>
Sent: Thursday, May 30, 2019 12:59:14 AM
To: shane knapp
Cc: Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen Fen; Xiangrui Meng; dev; user
Subject: Re: Should python-2 be supported in Spark 3.0?
 
+1 on Xiangrui’s plan.

On Thu, May 30, 2019 at 7:55 AM shane knapp <[hidden email]> wrote:
I don't have a good sense of the overhead of continuing to support
Python 2; is it large enough to consider dropping it in Spark 3.0?

from the build/test side, it will actually be pretty easy to continue support for python2.7 for spark 2.x as the feature sets won't be expanding. 

that being said, i will be cracking a bottle of champagne when i can delete all of the ansible and anaconda configs for python2.x.  :)

On the development side, in a future release that drops Python 2 support we can remove code that maintains python 2/3 compatibility and start using python 3 only features, which is also quite exciting.
 

shane
--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

shane knapp
+1000  ;)

On Sat, Jun 1, 2019 at 6:53 AM Denny Lee <[hidden email]> wrote:
+1

On Fri, May 31, 2019 at 17:58 Holden Karau <[hidden email]> wrote:
+1

On Fri, May 31, 2019 at 5:41 PM Bryan Cutler <[hidden email]> wrote:
+1 and the draft sounds good

On Thu, May 30, 2019, 11:32 AM Xiangrui Meng <[hidden email]> wrote:
Here is the draft announcement:

===
Plan for dropping Python 2 support

As many of you already knew, Python core development team and many utilized Python packages like Pandas and NumPy will drop Python 2 support in or before 2020/01/01. Apache Spark has supported both Python 2 and 3 since Spark 1.4 release in 2015. However, maintaining Python 2/3 compatibility is an increasing burden and it essentially limits the use of Python 3 features in Spark. Given the end of life (EOL) of Python 2 is coming, we plan to eventually drop Python 2 support as well. The current plan is as follows:

* In the next major release in 2019, we will deprecate Python 2 support. PySpark users will see a deprecation warning if Python 2 is used. We will publish a migration guide for PySpark users to migrate to Python 3.
* We will drop Python 2 support in a future release in 2020, after Python 2 EOL on 2020/01/01. PySpark users will see an error if Python 2 is used.
* For releases that support Python 2, e.g., Spark 2.4, their patch releases will continue supporting Python 2. However, after Python 2 EOL, we might not take patches that are specific to Python 2.
===

Sean helped make a pass. If it looks good, I'm going to upload it to Spark website and announce it here. Let me know if you think we should do a VOTE instead.

On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng <[hidden email]> wrote:

On Thu, May 30, 2019 at 2:18 AM Felix Cheung <[hidden email]> wrote:
We don’t usually reference a future release on website

> Spark website and state that Python 2 is deprecated in Spark 3.0

I suspect people will then ask when is Spark 3.0 coming out then. Might need to provide some clarity on that.

We can say the "next major release in 2019" instead of Spark 3.0. Spark 3.0 timeline certainly requires a new thread to discuss.
 



From: Reynold Xin <[hidden email]>
Sent: Thursday, May 30, 2019 12:59:14 AM
To: shane knapp
Cc: Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen Fen; Xiangrui Meng; dev; user
Subject: Re: Should python-2 be supported in Spark 3.0?
 
+1 on Xiangrui’s plan.

On Thu, May 30, 2019 at 7:55 AM shane knapp <[hidden email]> wrote:
I don't have a good sense of the overhead of continuing to support
Python 2; is it large enough to consider dropping it in Spark 3.0?

from the build/test side, it will actually be pretty easy to continue support for python2.7 for spark 2.x as the feature sets won't be expanding. 

that being said, i will be cracking a bottle of champagne when i can delete all of the ansible and anaconda configs for python2.x.  :)

On the development side, in a future release that drops Python 2 support we can remove code that maintains python 2/3 compatibility and start using python 3 only features, which is also quite exciting.
 

shane
--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

Xiangrui Meng
I updated Spark website and announced the plan for dropping python 2 support there: http://spark.apache.org/news/plan-for-dropping-python-2-support.html. I will send an announcement email to user@ and dev@. -Xiangrui

On Fri, May 31, 2019 at 10:54 PM Felix Cheung <[hidden email]> wrote:
Very subtle but someone might take

“We will drop Python 2 support in a future release in 2020”

To mean any / first release in 2020. Whereas the next statement indicates patch release is not included in above. Might help reorder the items or clarify the wording.



From: shane knapp <[hidden email]>
Sent: Friday, May 31, 2019 7:38:10 PM
To: Denny Lee
Cc: Holden Karau; Bryan Cutler; Erik Erlandson; Felix Cheung; Mark Hamstra; Matei Zaharia; Reynold Xin; Sean Owen; Wenchen Fen; Xiangrui Meng; dev; user
Subject: Re: Should python-2 be supported in Spark 3.0?
 
+1000  ;)

On Sat, Jun 1, 2019 at 6:53 AM Denny Lee <[hidden email]> wrote:
+1

On Fri, May 31, 2019 at 17:58 Holden Karau <[hidden email]> wrote:
+1

On Fri, May 31, 2019 at 5:41 PM Bryan Cutler <[hidden email]> wrote:
+1 and the draft sounds good

On Thu, May 30, 2019, 11:32 AM Xiangrui Meng <[hidden email]> wrote:
Here is the draft announcement:

===
Plan for dropping Python 2 support

As many of you already knew, Python core development team and many utilized Python packages like Pandas and NumPy will drop Python 2 support in or before 2020/01/01. Apache Spark has supported both Python 2 and 3 since Spark 1.4 release in 2015. However, maintaining Python 2/3 compatibility is an increasing burden and it essentially limits the use of Python 3 features in Spark. Given the end of life (EOL) of Python 2 is coming, we plan to eventually drop Python 2 support as well. The current plan is as follows:

* In the next major release in 2019, we will deprecate Python 2 support. PySpark users will see a deprecation warning if Python 2 is used. We will publish a migration guide for PySpark users to migrate to Python 3.
* We will drop Python 2 support in a future release in 2020, after Python 2 EOL on 2020/01/01. PySpark users will see an error if Python 2 is used.
* For releases that support Python 2, e.g., Spark 2.4, their patch releases will continue supporting Python 2. However, after Python 2 EOL, we might not take patches that are specific to Python 2.
===

Sean helped make a pass. If it looks good, I'm going to upload it to Spark website and announce it here. Let me know if you think we should do a VOTE instead.

On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng <[hidden email]> wrote:

On Thu, May 30, 2019 at 2:18 AM Felix Cheung <[hidden email]> wrote:
We don’t usually reference a future release on website

> Spark website and state that Python 2 is deprecated in Spark 3.0

I suspect people will then ask when is Spark 3.0 coming out then. Might need to provide some clarity on that.

We can say the "next major release in 2019" instead of Spark 3.0. Spark 3.0 timeline certainly requires a new thread to discuss.
 



From: Reynold Xin <[hidden email]>
Sent: Thursday, May 30, 2019 12:59:14 AM
To: shane knapp
Cc: Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen Fen; Xiangrui Meng; dev; user
Subject: Re: Should python-2 be supported in Spark 3.0?
 
+1 on Xiangrui’s plan.

On Thu, May 30, 2019 at 7:55 AM shane knapp <[hidden email]> wrote:
I don't have a good sense of the overhead of continuing to support
Python 2; is it large enough to consider dropping it in Spark 3.0?

from the build/test side, it will actually be pretty easy to continue support for python2.7 for spark 2.x as the feature sets won't be expanding. 

that being said, i will be cracking a bottle of champagne when i can delete all of the ansible and anaconda configs for python2.x.  :)

On the development side, in a future release that drops Python 2 support we can remove code that maintains python 2/3 compatibility and start using python 3 only features, which is also quite exciting.
 

shane
--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead