Revisiting the idea of a Spark 2.5 transitional release

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Revisiting the idea of a Spark 2.5 transitional release

Holden Karau
Hi Folks,

As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5 release. Spark 3 brings a number of important changes, and by its nature is not backward compatible. I think we'd all like to have as smooth an upgrade experience to Spark 3 as possible, and I believe that having a Spark 2 release some of the new functionality while continuing to support the older APIs and current Scala version would make the upgrade path smoother.

This pattern is not uncommon in other Hadoop ecosystem projects, like Hadoop itself and HBase.

I know that Ryan Blue has indicated he is already going to be maintaining something like that internally at Netflix, and we'll be doing the same thing at Apple. It seems like having a transitional release could benefit the community with easy migrations and help avoid duplicated work.

I want to be clear I'm volunteering to do the work of managing a 2.5 release, so hopefully, this wouldn't create any substantial burdens on the community.

Cheers,

Holden
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting the idea of a Spark 2.5 transitional release

Xiao Li-2
Which new functionalities are you referring to? In Spark SQL, most of the major features in Spark 3.0 are difficult/time-consuming to backport. For example, adaptive query execution. Releasing a new version is not hard, but backporting/reviewing/maintaining these features are very time-consuming.  

Which old APIs are broken? If the impact is big, we should add them back based on our former discussion http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-Modification-to-Spark-s-Semantic-Versioning-Policy-td28938.html

Thanks,

Xiao


On Fri, Jun 12, 2020 at 2:38 PM Holden Karau <[hidden email]> wrote:
Hi Folks,

As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5 release. Spark 3 brings a number of important changes, and by its nature is not backward compatible. I think we'd all like to have as smooth an upgrade experience to Spark 3 as possible, and I believe that having a Spark 2 release some of the new functionality while continuing to support the older APIs and current Scala version would make the upgrade path smoother.

This pattern is not uncommon in other Hadoop ecosystem projects, like Hadoop itself and HBase.

I know that Ryan Blue has indicated he is already going to be maintaining something like that internally at Netflix, and we'll be doing the same thing at Apple. It seems like having a transitional release could benefit the community with easy migrations and help avoid duplicated work.

I want to be clear I'm volunteering to do the work of managing a 2.5 release, so hopefully, this wouldn't create any substantial burdens on the community.

Cheers,

Holden
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting the idea of a Spark 2.5 transitional release

Sean Owen-2
In reply to this post by Holden Karau
What is the functionality that would go into a 2.5.0 release, that can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x maintenance branch, and I personally could imagine being open to more freely backporting a few new features for 2.x users, whereas usually it's only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance branch but there's something too big for a 'normal' maintenance release, and I think the whole question turns on what that is.

If it's things like JDK 11 support, I think that is unfortunately fairly 'breaking' because of dependency updates. But maybe that's not it.


On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <[hidden email]> wrote:
Hi Folks,

As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5 release. Spark 3 brings a number of important changes, and by its nature is not backward compatible. I think we'd all like to have as smooth an upgrade experience to Spark 3 as possible, and I believe that having a Spark 2 release some of the new functionality while continuing to support the older APIs and current Scala version would make the upgrade path smoother.

This pattern is not uncommon in other Hadoop ecosystem projects, like Hadoop itself and HBase.

I know that Ryan Blue has indicated he is already going to be maintaining something like that internally at Netflix, and we'll be doing the same thing at Apple. It seems like having a transitional release could benefit the community with easy migrations and help avoid duplicated work.

I want to be clear I'm volunteering to do the work of managing a 2.5 release, so hopefully, this wouldn't create any substantial burdens on the community.

Cheers,

Holden
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting the idea of a Spark 2.5 transitional release

Holden Karau
So I one of the things which we’re planning on backporting internally is DSv2, which I think being available in a community release in a 2 branch would be more broadly useful. Anything else on top of that would be on a case by case basis for if they make an easier upgrade path to 3.

If we’re worried about people using 2.5 as a long term home we could always mark it with “-transitional” or something similar?

On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <[hidden email]> wrote:
What is the functionality that would go into a 2.5.0 release, that can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x maintenance branch, and I personally could imagine being open to more freely backporting a few new features for 2.x users, whereas usually it's only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance branch but there's something too big for a 'normal' maintenance release, and I think the whole question turns on what that is.

If it's things like JDK 11 support, I think that is unfortunately fairly 'breaking' because of dependency updates. But maybe that's not it.


On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <[hidden email]> wrote:
Hi Folks,

As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5 release. Spark 3 brings a number of important changes, and by its nature is not backward compatible. I think we'd all like to have as smooth an upgrade experience to Spark 3 as possible, and I believe that having a Spark 2 release some of the new functionality while continuing to support the older APIs and current Scala version would make the upgrade path smoother.

This pattern is not uncommon in other Hadoop ecosystem projects, like Hadoop itself and HBase.

I know that Ryan Blue has indicated he is already going to be maintaining something like that internally at Netflix, and we'll be doing the same thing at Apple. It seems like having a transitional release could benefit the community with easy migrations and help avoid duplicated work.

I want to be clear I'm volunteering to do the work of managing a 2.5 release, so hopefully, this wouldn't create any substantial burdens on the community.

Cheers,

Holden
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting the idea of a Spark 2.5 transitional release

Xiao Li-2
Based on my understanding, DSV2 is not stable yet. It still misses various features. Even our built-in file sources are still unable to fully migrate to DSV2. We plan to enhance it in the next few releases to close the gap. 

Also, the changes on DSV2 in Spark 3.0 did not break any existing application. We should encourage more users to try Spark 3 and increase the adoption of Spark 3.x. 

Xiao 

On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <[hidden email]> wrote:
So I one of the things which we’re planning on backporting internally is DSv2, which I think being available in a community release in a 2 branch would be more broadly useful. Anything else on top of that would be on a case by case basis for if they make an easier upgrade path to 3.

If we’re worried about people using 2.5 as a long term home we could always mark it with “-transitional” or something similar?

On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <[hidden email]> wrote:
What is the functionality that would go into a 2.5.0 release, that can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x maintenance branch, and I personally could imagine being open to more freely backporting a few new features for 2.x users, whereas usually it's only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance branch but there's something too big for a 'normal' maintenance release, and I think the whole question turns on what that is.

If it's things like JDK 11 support, I think that is unfortunately fairly 'breaking' because of dependency updates. But maybe that's not it.


On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <[hidden email]> wrote:
Hi Folks,

As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5 release. Spark 3 brings a number of important changes, and by its nature is not backward compatible. I think we'd all like to have as smooth an upgrade experience to Spark 3 as possible, and I believe that having a Spark 2 release some of the new functionality while continuing to support the older APIs and current Scala version would make the upgrade path smoother.

This pattern is not uncommon in other Hadoop ecosystem projects, like Hadoop itself and HBase.

I know that Ryan Blue has indicated he is already going to be maintaining something like that internally at Netflix, and we'll be doing the same thing at Apple. It seems like having a transitional release could benefit the community with easy migrations and help avoid duplicated work.

I want to be clear I'm volunteering to do the work of managing a 2.5 release, so hopefully, this wouldn't create any substantial burdens on the community.

Cheers,

Holden
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting the idea of a Spark 2.5 transitional release

Ryan Blue
+1 for a 2.x release with a DSv2 API that matches 3.0.

There are a lot of big differences between the API in 2.4 and 3.0, and I think a release to help migrate would be beneficial to organizations like ours that will be supporting 2.x and 3.0 in parallel for quite a while. Migration to Spark 3 is going to take time as people build confidence in it. I don't think that can be avoided by leaving a larger feature gap between 2.x and 3.0.

On Fri, Jun 12, 2020 at 5:53 PM Xiao Li <[hidden email]> wrote:
Based on my understanding, DSV2 is not stable yet. It still misses various features. Even our built-in file sources are still unable to fully migrate to DSV2. We plan to enhance it in the next few releases to close the gap. 

Also, the changes on DSV2 in Spark 3.0 did not break any existing application. We should encourage more users to try Spark 3 and increase the adoption of Spark 3.x. 

Xiao 

On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <[hidden email]> wrote:
So I one of the things which we’re planning on backporting internally is DSv2, which I think being available in a community release in a 2 branch would be more broadly useful. Anything else on top of that would be on a case by case basis for if they make an easier upgrade path to 3.

If we’re worried about people using 2.5 as a long term home we could always mark it with “-transitional” or something similar?

On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <[hidden email]> wrote:
What is the functionality that would go into a 2.5.0 release, that can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x maintenance branch, and I personally could imagine being open to more freely backporting a few new features for 2.x users, whereas usually it's only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance branch but there's something too big for a 'normal' maintenance release, and I think the whole question turns on what that is.

If it's things like JDK 11 support, I think that is unfortunately fairly 'breaking' because of dependency updates. But maybe that's not it.


On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <[hidden email]> wrote:
Hi Folks,

As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5 release. Spark 3 brings a number of important changes, and by its nature is not backward compatible. I think we'd all like to have as smooth an upgrade experience to Spark 3 as possible, and I believe that having a Spark 2 release some of the new functionality while continuing to support the older APIs and current Scala version would make the upgrade path smoother.

This pattern is not uncommon in other Hadoop ecosystem projects, like Hadoop itself and HBase.

I know that Ryan Blue has indicated he is already going to be maintaining something like that internally at Netflix, and we'll be doing the same thing at Apple. It seems like having a transitional release could benefit the community with easy migrations and help avoid duplicated work.

I want to be clear I'm volunteering to do the work of managing a 2.5 release, so hopefully, this wouldn't create any substantial burdens on the community.

Cheers,

Holden
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting the idea of a Spark 2.5 transitional release

Jungtaek Lim-2
I guess we already went through the same discussion, right? If anyone is missed, please go through the discussion thread. [1] The consensus looks to be not positive to migrate the new DSv2 into Spark 2.x version line, because the change is pretty much huge, and also backward incompatible.

What I can think of benefits of having Spark 2.5 is to avoid force upgrade to the major release to have fixes for critical bugs. Not all critical fixes were landed to 2.x as well, because some fixes bring backward incompatibility. We don't land these fixes to the 2.x version line because we didn't consider having Spark 2.5 before - we don't want to let end users tolerate the inconvenience during upgrading bugfix version. End users may be OK to tolerate during upgrading minor version, since they can still live with 2.4.x to deny these fixes.

In addition, given there's a huge time gap between Spark 2.4 and 3.0, we might want to consider porting some of features which don't bring backward incompatibility. Well, new major features of Spark 3.0 would be probably better to be introduced in Spark 3.0, but some features could be, especially if the feature resolves the long-standing issue or the feature has been provided for a long time in competitive products.

Thanks,

On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue <[hidden email]> wrote:
+1 for a 2.x release with a DSv2 API that matches 3.0.

There are a lot of big differences between the API in 2.4 and 3.0, and I think a release to help migrate would be beneficial to organizations like ours that will be supporting 2.x and 3.0 in parallel for quite a while. Migration to Spark 3 is going to take time as people build confidence in it. I don't think that can be avoided by leaving a larger feature gap between 2.x and 3.0.

On Fri, Jun 12, 2020 at 5:53 PM Xiao Li <[hidden email]> wrote:
Based on my understanding, DSV2 is not stable yet. It still misses various features. Even our built-in file sources are still unable to fully migrate to DSV2. We plan to enhance it in the next few releases to close the gap. 

Also, the changes on DSV2 in Spark 3.0 did not break any existing application. We should encourage more users to try Spark 3 and increase the adoption of Spark 3.x. 

Xiao 

On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <[hidden email]> wrote:
So I one of the things which we’re planning on backporting internally is DSv2, which I think being available in a community release in a 2 branch would be more broadly useful. Anything else on top of that would be on a case by case basis for if they make an easier upgrade path to 3.

If we’re worried about people using 2.5 as a long term home we could always mark it with “-transitional” or something similar?

On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <[hidden email]> wrote:
What is the functionality that would go into a 2.5.0 release, that can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x maintenance branch, and I personally could imagine being open to more freely backporting a few new features for 2.x users, whereas usually it's only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance branch but there's something too big for a 'normal' maintenance release, and I think the whole question turns on what that is.

If it's things like JDK 11 support, I think that is unfortunately fairly 'breaking' because of dependency updates. But maybe that's not it.


On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <[hidden email]> wrote:
Hi Folks,

As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5 release. Spark 3 brings a number of important changes, and by its nature is not backward compatible. I think we'd all like to have as smooth an upgrade experience to Spark 3 as possible, and I believe that having a Spark 2 release some of the new functionality while continuing to support the older APIs and current Scala version would make the upgrade path smoother.

This pattern is not uncommon in other Hadoop ecosystem projects, like Hadoop itself and HBase.

I know that Ryan Blue has indicated he is already going to be maintaining something like that internally at Netflix, and we'll be doing the same thing at Apple. It seems like having a transitional release could benefit the community with easy migrations and help avoid duplicated work.

I want to be clear I'm volunteering to do the work of managing a 2.5 release, so hopefully, this wouldn't create any substantial burdens on the community.

Cheers,

Holden
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting the idea of a Spark 2.5 transitional release

DB Tsai-3
+1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support

We had an internal preview version of Spark 3.0 for our customers to try out for a while, and then we realized that it's very challenging for enterprise applications in production to move to Spark 3.0. For example, many of our customers' Spark applications depend on some internal projects that may not be owned by ETL teams; it requires much coordination with other teams to cross-build the dependencies that Spark applications depend on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate from 2.x version to 3.0 based on my observation working with our customers.

Also, JDK8 is already EOL, in some companies, using JDK8 is not supported by the infra team, and requires an exception to use unsupported JDK. Of course, for those companies, they can use vendor's Spark distribution such as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark release which is possible but not very trivial.

As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support can definitely lower the gap, and users can still move forward using new features. Afterall, the reason why we are working on OSS is we like people to use our code, isn't it?

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1


On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim <[hidden email]> wrote:
I guess we already went through the same discussion, right? If anyone is missed, please go through the discussion thread. [1] The consensus looks to be not positive to migrate the new DSv2 into Spark 2.x version line, because the change is pretty much huge, and also backward incompatible.

What I can think of benefits of having Spark 2.5 is to avoid force upgrade to the major release to have fixes for critical bugs. Not all critical fixes were landed to 2.x as well, because some fixes bring backward incompatibility. We don't land these fixes to the 2.x version line because we didn't consider having Spark 2.5 before - we don't want to let end users tolerate the inconvenience during upgrading bugfix version. End users may be OK to tolerate during upgrading minor version, since they can still live with 2.4.x to deny these fixes.

In addition, given there's a huge time gap between Spark 2.4 and 3.0, we might want to consider porting some of features which don't bring backward incompatibility. Well, new major features of Spark 3.0 would be probably better to be introduced in Spark 3.0, but some features could be, especially if the feature resolves the long-standing issue or the feature has been provided for a long time in competitive products.

Thanks,

On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue <[hidden email]> wrote:
+1 for a 2.x release with a DSv2 API that matches 3.0.

There are a lot of big differences between the API in 2.4 and 3.0, and I think a release to help migrate would be beneficial to organizations like ours that will be supporting 2.x and 3.0 in parallel for quite a while. Migration to Spark 3 is going to take time as people build confidence in it. I don't think that can be avoided by leaving a larger feature gap between 2.x and 3.0.

On Fri, Jun 12, 2020 at 5:53 PM Xiao Li <[hidden email]> wrote:
Based on my understanding, DSV2 is not stable yet. It still misses various features. Even our built-in file sources are still unable to fully migrate to DSV2. We plan to enhance it in the next few releases to close the gap. 

Also, the changes on DSV2 in Spark 3.0 did not break any existing application. We should encourage more users to try Spark 3 and increase the adoption of Spark 3.x. 

Xiao 

On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <[hidden email]> wrote:
So I one of the things which we’re planning on backporting internally is DSv2, which I think being available in a community release in a 2 branch would be more broadly useful. Anything else on top of that would be on a case by case basis for if they make an easier upgrade path to 3.

If we’re worried about people using 2.5 as a long term home we could always mark it with “-transitional” or something similar?

On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <[hidden email]> wrote:
What is the functionality that would go into a 2.5.0 release, that can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x maintenance branch, and I personally could imagine being open to more freely backporting a few new features for 2.x users, whereas usually it's only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance branch but there's something too big for a 'normal' maintenance release, and I think the whole question turns on what that is.

If it's things like JDK 11 support, I think that is unfortunately fairly 'breaking' because of dependency updates. But maybe that's not it.


On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <[hidden email]> wrote:
Hi Folks,

As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5 release. Spark 3 brings a number of important changes, and by its nature is not backward compatible. I think we'd all like to have as smooth an upgrade experience to Spark 3 as possible, and I believe that having a Spark 2 release some of the new functionality while continuing to support the older APIs and current Scala version would make the upgrade path smoother.

This pattern is not uncommon in other Hadoop ecosystem projects, like Hadoop itself and HBase.

I know that Ryan Blue has indicated he is already going to be maintaining something like that internally at Netflix, and we'll be doing the same thing at Apple. It seems like having a transitional release could benefit the community with easy migrations and help avoid duplicated work.

I want to be clear I'm volunteering to do the work of managing a 2.5 release, so hopefully, this wouldn't create any substantial burdens on the community.

Cheers,

Holden
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting the idea of a Spark 2.5 transitional release

rxin
I understand the argument to add JDK 11 support just to extend the EOL, but the other things seem kind of arbitrary and are not supported by your arguments, especially DSv2 which is a massive change. DSv2 IIUC is not api stable yet and will continue to evolve in the 3.x line. 

Spark is designed in a way that’s decoupled from storage, and as a result one can run multiple versions of Spark in parallel during migration. 

On Fri, Jun 12, 2020 at 9:40 PM DB Tsai <[hidden email]> wrote:
+1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support

We had an internal preview version of Spark 3.0 for our customers to try out for a while, and then we realized that it's very challenging for enterprise applications in production to move to Spark 3.0. For example, many of our customers' Spark applications depend on some internal projects that may not be owned by ETL teams; it requires much coordination with other teams to cross-build the dependencies that Spark applications depend on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate from 2.x version to 3.0 based on my observation working with our customers.

Also, JDK8 is already EOL, in some companies, using JDK8 is not supported by the infra team, and requires an exception to use unsupported JDK. Of course, for those companies, they can use vendor's Spark distribution such as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark release which is possible but not very trivial.

As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support can definitely lower the gap, and users can still move forward using new features. Afterall, the reason why we are working on OSS is we like people to use our code, isn't it?

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1


On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim <[hidden email]> wrote:
I guess we already went through the same discussion, right? If anyone is missed, please go through the discussion thread. [1] The consensus looks to be not positive to migrate the new DSv2 into Spark 2.x version line, because the change is pretty much huge, and also backward incompatible.

What I can think of benefits of having Spark 2.5 is to avoid force upgrade to the major release to have fixes for critical bugs. Not all critical fixes were landed to 2.x as well, because some fixes bring backward incompatibility. We don't land these fixes to the 2.x version line because we didn't consider having Spark 2.5 before - we don't want to let end users tolerate the inconvenience during upgrading bugfix version. End users may be OK to tolerate during upgrading minor version, since they can still live with 2.4.x to deny these fixes.

In addition, given there's a huge time gap between Spark 2.4 and 3.0, we might want to consider porting some of features which don't bring backward incompatibility. Well, new major features of Spark 3.0 would be probably better to be introduced in Spark 3.0, but some features could be, especially if the feature resolves the long-standing issue or the feature has been provided for a long time in competitive products.

Thanks,

On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue <[hidden email]> wrote:
+1 for a 2.x release with a DSv2 API that matches 3.0.

There are a lot of big differences between the API in 2.4 and 3.0, and I think a release to help migrate would be beneficial to organizations like ours that will be supporting 2.x and 3.0 in parallel for quite a while. Migration to Spark 3 is going to take time as people build confidence in it. I don't think that can be avoided by leaving a larger feature gap between 2.x and 3.0.

On Fri, Jun 12, 2020 at 5:53 PM Xiao Li <[hidden email]> wrote:
Based on my understanding, DSV2 is not stable yet. It still misses various features. Even our built-in file sources are still unable to fully migrate to DSV2. We plan to enhance it in the next few releases to close the gap. 

Also, the changes on DSV2 in Spark 3.0 did not break any existing application. We should encourage more users to try Spark 3 and increase the adoption of Spark 3.x. 

Xiao 

On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <[hidden email]> wrote:
So I one of the things which we’re planning on backporting internally is DSv2, which I think being available in a community release in a 2 branch would be more broadly useful. Anything else on top of that would be on a case by case basis for if they make an easier upgrade path to 3.

If we’re worried about people using 2.5 as a long term home we could always mark it with “-transitional” or something similar?

On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <[hidden email]> wrote:
What is the functionality that would go into a 2.5.0 release, that can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x maintenance branch, and I personally could imagine being open to more freely backporting a few new features for 2.x users, whereas usually it's only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance branch but there's something too big for a 'normal' maintenance release, and I think the whole question turns on what that is.

If it's things like JDK 11 support, I think that is unfortunately fairly 'breaking' because of dependency updates. But maybe that's not it.


On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <[hidden email]> wrote:
Hi Folks,

As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5 release. Spark 3 brings a number of important changes, and by its nature is not backward compatible. I think we'd all like to have as smooth an upgrade experience to Spark 3 as possible, and I believe that having a Spark 2 release some of the new functionality while continuing to support the older APIs and current Scala version would make the upgrade path smoother.

This pattern is not uncommon in other Hadoop ecosystem projects, like Hadoop itself and HBase.

I know that Ryan Blue has indicated he is already going to be maintaining something like that internally at Netflix, and we'll be doing the same thing at Apple. It seems like having a transitional release could benefit the community with easy migrations and help avoid duplicated work.

I want to be clear I'm volunteering to do the work of managing a 2.5 release, so hopefully, this wouldn't create any substantial burdens on the community.

Cheers,

Holden
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--


--
Ryan Blue
Software Engineer
Netflix

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting the idea of a Spark 2.5 transitional release

Holden Karau
Can I suggest we maybe decouple this conversation a bit? First, if there is an agreement in making a transitional release in principle and then folks who feel strongly about specific backports can have their respective discussions.It's not like we normally know or have agreement on everything going into a release at the time we cut the branch.

On Fri, Jun 12, 2020 at 10:28 PM Reynold Xin <[hidden email]> wrote:
I understand the argument to add JDK 11 support just to extend the EOL, but the other things seem kind of arbitrary and are not supported by your arguments, especially DSv2 which is a massive change. DSv2 IIUC is not api stable yet and will continue to evolve in the 3.x line. 

Spark is designed in a way that’s decoupled from storage, and as a result one can run multiple versions of Spark in parallel during migration. 
At the job level sure, but upgrading large jobs, possibly written in Scala 2.11, whole-hog as it currently stands is not a small matter. 

On Fri, Jun 12, 2020 at 9:40 PM DB Tsai <[hidden email]> wrote:
+1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support

We had an internal preview version of Spark 3.0 for our customers to try out for a while, and then we realized that it's very challenging for enterprise applications in production to move to Spark 3.0. For example, many of our customers' Spark applications depend on some internal projects that may not be owned by ETL teams; it requires much coordination with other teams to cross-build the dependencies that Spark applications depend on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate from 2.x version to 3.0 based on my observation working with our customers.

Also, JDK8 is already EOL, in some companies, using JDK8 is not supported by the infra team, and requires an exception to use unsupported JDK. Of course, for those companies, they can use vendor's Spark distribution such as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark release which is possible but not very trivial.

As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support can definitely lower the gap, and users can still move forward using new features. Afterall, the reason why we are working on OSS is we like people to use our code, isn't it?

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1


On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim <[hidden email]> wrote:
I guess we already went through the same discussion, right? If anyone is missed, please go through the discussion thread. [1] The consensus looks to be not positive to migrate the new DSv2 into Spark 2.x version line, because the change is pretty much huge, and also backward incompatible.

What I can think of benefits of having Spark 2.5 is to avoid force upgrade to the major release to have fixes for critical bugs. Not all critical fixes were landed to 2.x as well, because some fixes bring backward incompatibility. We don't land these fixes to the 2.x version line because we didn't consider having Spark 2.5 before - we don't want to let end users tolerate the inconvenience during upgrading bugfix version. End users may be OK to tolerate during upgrading minor version, since they can still live with 2.4.x to deny these fixes.

In addition, given there's a huge time gap between Spark 2.4 and 3.0, we might want to consider porting some of features which don't bring backward incompatibility. Well, new major features of Spark 3.0 would be probably better to be introduced in Spark 3.0, but some features could be, especially if the feature resolves the long-standing issue or the feature has been provided for a long time in competitive products.

Thanks,

On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue <[hidden email]> wrote:
+1 for a 2.x release with a DSv2 API that matches 3.0.

There are a lot of big differences between the API in 2.4 and 3.0, and I think a release to help migrate would be beneficial to organizations like ours that will be supporting 2.x and 3.0 in parallel for quite a while. Migration to Spark 3 is going to take time as people build confidence in it. I don't think that can be avoided by leaving a larger feature gap between 2.x and 3.0.

On Fri, Jun 12, 2020 at 5:53 PM Xiao Li <[hidden email]> wrote:
Based on my understanding, DSV2 is not stable yet. It still misses various features. Even our built-in file sources are still unable to fully migrate to DSV2. We plan to enhance it in the next few releases to close the gap. 

Also, the changes on DSV2 in Spark 3.0 did not break any existing application. We should encourage more users to try Spark 3 and increase the adoption of Spark 3.x. 

Xiao 

On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <[hidden email]> wrote:
So I one of the things which we’re planning on backporting internally is DSv2, which I think being available in a community release in a 2 branch would be more broadly useful. Anything else on top of that would be on a case by case basis for if they make an easier upgrade path to 3.

If we’re worried about people using 2.5 as a long term home we could always mark it with “-transitional” or something similar?

On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <[hidden email]> wrote:
What is the functionality that would go into a 2.5.0 release, that can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x maintenance branch, and I personally could imagine being open to more freely backporting a few new features for 2.x users, whereas usually it's only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance branch but there's something too big for a 'normal' maintenance release, and I think the whole question turns on what that is.

If it's things like JDK 11 support, I think that is unfortunately fairly 'breaking' because of dependency updates. But maybe that's not it.


On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <[hidden email]> wrote:
Hi Folks,

As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5 release. Spark 3 brings a number of important changes, and by its nature is not backward compatible. I think we'd all like to have as smooth an upgrade experience to Spark 3 as possible, and I believe that having a Spark 2 release some of the new functionality while continuing to support the older APIs and current Scala version would make the upgrade path smoother.

This pattern is not uncommon in other Hadoop ecosystem projects, like Hadoop itself and HBase.

I know that Ryan Blue has indicated he is already going to be maintaining something like that internally at Netflix, and we'll be doing the same thing at Apple. It seems like having a transitional release could benefit the community with easy migrations and help avoid duplicated work.

I want to be clear I'm volunteering to do the work of managing a 2.5 release, so hopefully, this wouldn't create any substantial burdens on the community.

Cheers,

Holden
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting the idea of a Spark 2.5 transitional release

rxin
Echoing Sean's earlier comment … What is the functionality that would go into a 2.5.0 release, that can't be in a 2.4.7 release? 


On Fri, Jun 12, 2020 at 11:14 PM, Holden Karau <[hidden email]> wrote:
Can I suggest we maybe decouple this conversation a bit? First, if there is an agreement in making a transitional release in principle and then folks who feel strongly about specific backports can have their respective discussions.It's not like we normally know or have agreement on everything going into a release at the time we cut the branch.

On Fri, Jun 12, 2020 at 10:28 PM Reynold Xin <[hidden email]> wrote:
I understand the argument to add JDK 11 support just to extend the EOL, but the other things seem kind of arbitrary and are not supported by your arguments, especially DSv2 which is a massive change. DSv2 IIUC is not api stable yet and will continue to evolve in the 3.x line. 

Spark is designed in a way that’s decoupled from storage, and as a result one can run multiple versions of Spark in parallel during migration. 
At the job level sure, but upgrading large jobs, possibly written in Scala 2.11, whole-hog as it currently stands is not a small matter. 

On Fri, Jun 12, 2020 at 9:40 PM DB Tsai <[hidden email]> wrote:
+1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support

We had an internal preview version of Spark 3.0 for our customers to try out for a while, and then we realized that it's very challenging for enterprise applications in production to move to Spark 3.0. For example, many of our customers' Spark applications depend on some internal projects that may not be owned by ETL teams; it requires much coordination with other teams to cross-build the dependencies that Spark applications depend on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate from 2.x version to 3.0 based on my observation working with our customers.

Also, JDK8 is already EOL, in some companies, using JDK8 is not supported by the infra team, and requires an exception to use unsupported JDK. Of course, for those companies, they can use vendor's Spark distribution such as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark release which is possible but not very trivial.

As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support can definitely lower the gap, and users can still move forward using new features. Afterall, the reason why we are working on OSS is we like people to use our code, isn't it?

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1


On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim <[hidden email]> wrote:
I guess we already went through the same discussion, right? If anyone is missed, please go through the discussion thread. [1] The consensus looks to be not positive to migrate the new DSv2 into Spark 2.x version line, because the change is pretty much huge, and also backward incompatible.

What I can think of benefits of having Spark 2.5 is to avoid force upgrade to the major release to have fixes for critical bugs. Not all critical fixes were landed to 2.x as well, because some fixes bring backward incompatibility. We don't land these fixes to the 2.x version line because we didn't consider having Spark 2.5 before - we don't want to let end users tolerate the inconvenience during upgrading bugfix version. End users may be OK to tolerate during upgrading minor version, since they can still live with 2.4.x to deny these fixes.

In addition, given there's a huge time gap between Spark 2.4 and 3.0, we might want to consider porting some of features which don't bring backward incompatibility. Well, new major features of Spark 3.0 would be probably better to be introduced in Spark 3.0, but some features could be, especially if the feature resolves the long-standing issue or the feature has been provided for a long time in competitive products.

Thanks,

On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue <[hidden email]> wrote:
+1 for a 2.x release with a DSv2 API that matches 3.0.

There are a lot of big differences between the API in 2.4 and 3.0, and I think a release to help migrate would be beneficial to organizations like ours that will be supporting 2.x and 3.0 in parallel for quite a while. Migration to Spark 3 is going to take time as people build confidence in it. I don't think that can be avoided by leaving a larger feature gap between 2.x and 3.0.

On Fri, Jun 12, 2020 at 5:53 PM Xiao Li <[hidden email]> wrote:
Based on my understanding, DSV2 is not stable yet. It still misses various features. Even our built-in file sources are still unable to fully migrate to DSV2. We plan to enhance it in the next few releases to close the gap. 

Also, the changes on DSV2 in Spark 3.0 did not break any existing application. We should encourage more users to try Spark 3 and increase the adoption of Spark 3.x. 

Xiao 

On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <[hidden email]> wrote:
So I one of the things which we’re planning on backporting internally is DSv2, which I think being available in a community release in a 2 branch would be more broadly useful. Anything else on top of that would be on a case by case basis for if they make an easier upgrade path to 3.

If we’re worried about people using 2.5 as a long term home we could always mark it with “-transitional” or something similar?

On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <[hidden email]> wrote:
What is the functionality that would go into a 2.5.0 release, that can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x maintenance branch, and I personally could imagine being open to more freely backporting a few new features for 2.x users, whereas usually it's only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance branch but there's something too big for a 'normal' maintenance release, and I think the whole question turns on what that is.

If it's things like JDK 11 support, I think that is unfortunately fairly 'breaking' because of dependency updates. But maybe that's not it.


On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <[hidden email]> wrote:
Hi Folks,

As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5 release. Spark 3 brings a number of important changes, and by its nature is not backward compatible. I think we'd all like to have as smooth an upgrade experience to Spark 3 as possible, and I believe that having a Spark 2 release some of the new functionality while continuing to support the older APIs and current Scala version would make the upgrade path smoother.

This pattern is not uncommon in other Hadoop ecosystem projects, like Hadoop itself and HBase.

I know that Ryan Blue has indicated he is already going to be maintaining something like that internally at Netflix, and we'll be doing the same thing at Apple. It seems like having a transitional release could benefit the community with easy migrations and help avoid duplicated work.

I want to be clear I'm volunteering to do the work of managing a 2.5 release, so hopefully, this wouldn't create any substantial burdens on the community.

Cheers,

Holden
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting the idea of a Spark 2.5 transitional release

DB Tsai-7
For example, JDK11 requires dependency changes which can not go into 2.4.7. Recent development on Kube such as supporting dynamical allocation in Spark 3.0 in Kube (without shuffle service) will be hard to go in 2.4.7.

Sent from my iPhone

On Jun 12, 2020, at 11:50 PM, Reynold Xin <[hidden email]> wrote:


Echoing Sean's earlier comment … What is the functionality that would go into a 2.5.0 release, that can't be in a 2.4.7 release? 


On Fri, Jun 12, 2020 at 11:14 PM, Holden Karau <[hidden email]> wrote:
Can I suggest we maybe decouple this conversation a bit? First, if there is an agreement in making a transitional release in principle and then folks who feel strongly about specific backports can have their respective discussions.It's not like we normally know or have agreement on everything going into a release at the time we cut the branch.

On Fri, Jun 12, 2020 at 10:28 PM Reynold Xin <[hidden email]> wrote:
I understand the argument to add JDK 11 support just to extend the EOL, but the other things seem kind of arbitrary and are not supported by your arguments, especially DSv2 which is a massive change. DSv2 IIUC is not api stable yet and will continue to evolve in the 3.x line. 

Spark is designed in a way that’s decoupled from storage, and as a result one can run multiple versions of Spark in parallel during migration. 
At the job level sure, but upgrading large jobs, possibly written in Scala 2.11, whole-hog as it currently stands is not a small matter. 

On Fri, Jun 12, 2020 at 9:40 PM DB Tsai <[hidden email]> wrote:
+1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support

We had an internal preview version of Spark 3.0 for our customers to try out for a while, and then we realized that it's very challenging for enterprise applications in production to move to Spark 3.0. For example, many of our customers' Spark applications depend on some internal projects that may not be owned by ETL teams; it requires much coordination with other teams to cross-build the dependencies that Spark applications depend on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate from 2.x version to 3.0 based on my observation working with our customers.

Also, JDK8 is already EOL, in some companies, using JDK8 is not supported by the infra team, and requires an exception to use unsupported JDK. Of course, for those companies, they can use vendor's Spark distribution such as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark release which is possible but not very trivial.

As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support can definitely lower the gap, and users can still move forward using new features. Afterall, the reason why we are working on OSS is we like people to use our code, isn't it?

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1


On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim <[hidden email]> wrote:
I guess we already went through the same discussion, right? If anyone is missed, please go through the discussion thread. [1] The consensus looks to be not positive to migrate the new DSv2 into Spark 2.x version line, because the change is pretty much huge, and also backward incompatible.

What I can think of benefits of having Spark 2.5 is to avoid force upgrade to the major release to have fixes for critical bugs. Not all critical fixes were landed to 2.x as well, because some fixes bring backward incompatibility. We don't land these fixes to the 2.x version line because we didn't consider having Spark 2.5 before - we don't want to let end users tolerate the inconvenience during upgrading bugfix version. End users may be OK to tolerate during upgrading minor version, since they can still live with 2.4.x to deny these fixes.

In addition, given there's a huge time gap between Spark 2.4 and 3.0, we might want to consider porting some of features which don't bring backward incompatibility. Well, new major features of Spark 3.0 would be probably better to be introduced in Spark 3.0, but some features could be, especially if the feature resolves the long-standing issue or the feature has been provided for a long time in competitive products.

Thanks,

On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue <[hidden email]> wrote:
+1 for a 2.x release with a DSv2 API that matches 3.0.

There are a lot of big differences between the API in 2.4 and 3.0, and I think a release to help migrate would be beneficial to organizations like ours that will be supporting 2.x and 3.0 in parallel for quite a while. Migration to Spark 3 is going to take time as people build confidence in it. I don't think that can be avoided by leaving a larger feature gap between 2.x and 3.0.

On Fri, Jun 12, 2020 at 5:53 PM Xiao Li <[hidden email]> wrote:
Based on my understanding, DSV2 is not stable yet. It still misses various features. Even our built-in file sources are still unable to fully migrate to DSV2. We plan to enhance it in the next few releases to close the gap. 

Also, the changes on DSV2 in Spark 3.0 did not break any existing application. We should encourage more users to try Spark 3 and increase the adoption of Spark 3.x. 

Xiao 

On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <[hidden email]> wrote:
So I one of the things which we’re planning on backporting internally is DSv2, which I think being available in a community release in a 2 branch would be more broadly useful. Anything else on top of that would be on a case by case basis for if they make an easier upgrade path to 3.

If we’re worried about people using 2.5 as a long term home we could always mark it with “-transitional” or something similar?

On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <[hidden email]> wrote:
What is the functionality that would go into a 2.5.0 release, that can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x maintenance branch, and I personally could imagine being open to more freely backporting a few new features for 2.x users, whereas usually it's only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance branch but there's something too big for a 'normal' maintenance release, and I think the whole question turns on what that is.

If it's things like JDK 11 support, I think that is unfortunately fairly 'breaking' because of dependency updates. But maybe that's not it.


On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <[hidden email]> wrote:
Hi Folks,

As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5 release. Spark 3 brings a number of important changes, and by its nature is not backward compatible. I think we'd all like to have as smooth an upgrade experience to Spark 3 as possible, and I believe that having a Spark 2 release some of the new functionality while continuing to support the older APIs and current Scala version would make the upgrade path smoother.

This pattern is not uncommon in other Hadoop ecosystem projects, like Hadoop itself and HBase.

I know that Ryan Blue has indicated he is already going to be maintaining something like that internally at Netflix, and we'll be doing the same thing at Apple. It seems like having a transitional release could benefit the community with easy migrations and help avoid duplicated work.

I want to be clear I'm volunteering to do the work of managing a 2.5 release, so hopefully, this wouldn't create any substantial burdens on the community.

Cheers,

Holden
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

Reply | Threaded
Open this post in threaded view
|

Re: Revisiting the idea of a Spark 2.5 transitional release

Sean Owen-2
In reply to this post by DB Tsai-3
These two are coupled, and in tension: don't want to take much change, but do want changes that will unfortunately be somewhat breaking. A 2.5 release with these items would be different enough as to strain the general level of compatibility implied by a minor release. Sure, it's not 'just' a maintenance release, but de facto it becomes the maintenance branch of all of 2.x then, so kind of us. 2.4.x users then need to move to 2.5 too as eventually it's the only 2.x maintenance branch. OK, you can maintain 2.4.x and 2.5.x until 2.x is EOL, which does increase the complexity: everything backported goes to 2 branches, has to work with both.

I don't know if there's a reason to cut 2.5.0 just on principle; it had seemed pretty clear to me with 3.0 that 2.4.x was simply the last 2.x release. We normally maintain version x and x+1, and will expand to maintain 2.x + 3.0.x + 3.1.x soon. So it does depend on what would go in it.

One person's breaking change is another person's just-fine enhancement though. People wouldn't suggest it here unless they were in the latter group (though are we all talking about the same two major items?)
What I don't know is how that looks across the wider user base. Obviously, here are a few important votes in favor. On the other hand I haven't heard of significant issues in updating to 3.0 during the preview releases, which could suggest that users that DSv2 et al can just move to 3.0.

On the items: I don't know enough about DSv2 to say, but that seems like a big change to back port.
On JDK11: I understand Java 8 is EOL w.r.t. Oracle, but OpenJDK 8 is still being updated, and even Oracle supports it (for $). I have not perceived this to be a significant issue inside or outside Spark, anecdotally.

Yes, this can also be where downstream vendors supply support for a specialized hybrid build. 

I'm not sure there's an objectively right call here, certainly without more than anecdotal or personal perspective on the tradeoffs. It still seems like the current plan is fine to me though, to leave these items in 3.0.

We can also wait-and-see. If after 3.0 is GA there is clearly wide demand for a transitional release, that could change the calculation.


On Fri, Jun 12, 2020 at 11:40 PM DB Tsai <[hidden email]> wrote:
+1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support

We had an internal preview version of Spark 3.0 for our customers to try out for a while, and then we realized that it's very challenging for enterprise applications in production to move to Spark 3.0. For example, many of our customers' Spark applications depend on some internal projects that may not be owned by ETL teams; it requires much coordination with other teams to cross-build the dependencies that Spark applications depend on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate from 2.x version to 3.0 based on my observation working with our customers.

Also, JDK8 is already EOL, in some companies, using JDK8 is not supported by the infra team, and requires an exception to use unsupported JDK. Of course, for those companies, they can use vendor's Spark distribution such as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark release which is possible but not very trivial.

As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support can definitely lower the gap, and users can still move forward using new features. Afterall, the reason why we are working on OSS is we like people to use our code, isn't it?