[Proposal] Modification to Spark's Semantic Versioning Policy

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[Proposal] Modification to Spark's Semantic Versioning Policy

Michael Armbrust

Hello Everyone,


As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic versioning, so this major release is our chance to get it right [by breaking APIs]". Similarly, in many cases the response to questions about why an API was completely removed has been, "this API has been deprecated since x.x, so we have to remove it".


As a long time contributor to and user of Spark this interpretation of the policy is concerning to me. This reasoning misses the intention of the original policy, and I am worried that it will hurt the long-term success of the project.


I definitely understand that these are hard decisions, and I'm not proposing that we never remove anything from Spark. However, I would like to give some additional context and also propose a different rubric for thinking about API breakage moving forward.


Spark adopted semantic versioning back in 2014 during the preparations for the 1.0 release. As this was the first major release -- and as, up until fairly recently, Spark had only been an academic project -- no real promises had been made about API stability ever.


During the discussion, some committers suggested that this was an opportunity to clean up cruft and give the Spark APIs a once-over, making cosmetic changes to improve consistency. However, in the end, it was decided that in many cases it was not in the best interests of the Spark community to break things just because we could. Matei actually said it pretty forcefully:


I know that some names are suboptimal, but I absolutely detest breaking APIs, config names, etc. I’ve seen it happen way too often in other projects (even things we depend on that are officially post-1.0, like Akka or Protobuf or Hadoop), and it’s very painful. I think that we as fairly cutting-edge users are okay with libraries occasionally changing, but many others will consider it a show-stopper. Given this, I think that any cosmetic change now, even though it might improve clarity slightly, is not worth the tradeoff in terms of creating an update barrier for existing users.


In the end, while some changes were made, most APIs remained the same and users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served the project very well, as compatibility means users are able to upgrade and we keep as many people on the latest versions of Spark (though maybe not the latest APIs of Spark) as possible.


As Spark grows, I think compatibility actually becomes more important and we should be more conservative rather than less. Today, there are very likely more Spark programs running than there were at any other time in the past. Spark is no longer a tool only used by advanced hackers, it is now also running "traditional enterprise workloads.'' In many cases these jobs are powering important processes long after the original author leaves.


Broken APIs can also affect libraries that extend Spark. This dependency can be even harder for users, as if the library has not been upgraded to use new APIs and they need that library, they are stuck.


Given all of this, I'd like to propose the following rubric as an addition to our semantic versioning policy. After discussion and if people agree this is a good idea, I'll call a vote of the PMC to ratify its inclusion in the official policy.


Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.


Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:

  • Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate: 

    • How long has the API been in Spark?

    • Is the API common even for basic programs?

    • How often do we see recent questions in JIRA or mailing lists?

    • How often does it appear in StackOverflow or blogs?

  • Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:

    • Will there be a compiler or linker error?

    • Will there be a runtime exception?

    • Will that exception happen after significant processing has been done?

    • Will we silently return different answers? (very hard to debug, might not even notice!)


Cost of Maintaining an API

Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.

  • Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.

  • User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.


Alternatives to Breaking an API

In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.


  • Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.

  • Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.

  • Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.

  • Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.


Examples


Here are some examples of how I think the policy above could be applied to different issues that have been discussed recently. These are only to illustrate how to apply the above rubric, but are not intended to be part of the official policy.


[SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow multiple creation of SparkContexts #23311


  • Cost to Break - Multiple Contexts in a single JVM never worked properly. When users tried it they would nearly always report that Spark was broken (SPARK-2243), due to the confusing set of logs messages. Given this, I think it is very unlikely that there are many real world use cases active today. Even those cases likely suffer from undiagnosed issues as there are many areas of Spark that assume a single context per JVM.

  • Cost to Maintain - We have recently had users ask on the mailing list if this was supported, as the conf led them to believe it was, and the existence of this configuration as "supported" makes it harder to reason about certain global state in SparkContext.


Decision: Remove this configuration and related code.


[SPARK-25908] Remove registerTempTable #22921 (only looking at one API of this PR)


  • Cost to Break - This is a wildly popular API of Spark SQL that has been there since the first release. There are tons of blog posts and examples that use this syntax if you google "dataframe registerTempTable" (even more than the "correct" API "dataframe createOrReplaceView"). All of these will be invalid for users of Spark 3.0

  • Cost to Maintain - This is just an alias, so there is not a lot of extra machinery required to keep the API. Users have two ways to do the same thing, but we can note that this is just an alias in the docs.


Decision: Do not remove this API, I would even consider un-deprecating it. I anecdotally asked several users and this is the API they prefer over the "correct" one.

[SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195

  • Cost to Break - I think that this case actually exemplifies several anti-patterns in breaking APIs. In some languages, the deprecation warning gives you no help, other than what version the function was removed in. In R, it points users to a really deep conversation on the semantics of time in Spark SQL. None of the messages tell you how you should correctly be parsing a timestamp that is given to you in a format other than UTC. My guess is all users will blindly flip the flag to true (to keep using this function), so you've only succeeded in annoying them.

  • Cost to Maintain - These are two relatively isolated expressions, there should be little cost to keeping them. Users can be confused by their semantics, so we probably should update the docs to point them to a best practice (I learned only by complaining on the PR, that a good practice is to parse timestamps including the timezone in the format expression, which naturally shifts them to UTC).


Decision: Do not deprecate these two functions. We should update the docs to talk about best practices for parsing timestamps, including how to correctly shift them to UTC for storage.


[SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902


  • Cost to Break - The TRIM function takes two string parameters. If we switch the parameter order, queries that use the TRIM function would silently get different results on different versions of Spark. Users may not notice it for a long time and wrong query results may cause serious problems to users.

  • Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM function in Scala API and in SQL have different parameter order.


Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a warning, not by removing) the SQL TRIM function and move users to the SQL standard TRIM syntax.


Thanks for taking the time to read this! Happy to discuss the specifics and amend this policy as the community sees fit.


Michael


Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Xiao Li
+1 

Xiao

Michael Armbrust <[hidden email]> 于2020年2月24日周一 下午3:03写道:

Hello Everyone,


As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic versioning, so this major release is our chance to get it right [by breaking APIs]". Similarly, in many cases the response to questions about why an API was completely removed has been, "this API has been deprecated since x.x, so we have to remove it".


As a long time contributor to and user of Spark this interpretation of the policy is concerning to me. This reasoning misses the intention of the original policy, and I am worried that it will hurt the long-term success of the project.


I definitely understand that these are hard decisions, and I'm not proposing that we never remove anything from Spark. However, I would like to give some additional context and also propose a different rubric for thinking about API breakage moving forward.


Spark adopted semantic versioning back in 2014 during the preparations for the 1.0 release. As this was the first major release -- and as, up until fairly recently, Spark had only been an academic project -- no real promises had been made about API stability ever.


During the discussion, some committers suggested that this was an opportunity to clean up cruft and give the Spark APIs a once-over, making cosmetic changes to improve consistency. However, in the end, it was decided that in many cases it was not in the best interests of the Spark community to break things just because we could. Matei actually said it pretty forcefully:


I know that some names are suboptimal, but I absolutely detest breaking APIs, config names, etc. I’ve seen it happen way too often in other projects (even things we depend on that are officially post-1.0, like Akka or Protobuf or Hadoop), and it’s very painful. I think that we as fairly cutting-edge users are okay with libraries occasionally changing, but many others will consider it a show-stopper. Given this, I think that any cosmetic change now, even though it might improve clarity slightly, is not worth the tradeoff in terms of creating an update barrier for existing users.


In the end, while some changes were made, most APIs remained the same and users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served the project very well, as compatibility means users are able to upgrade and we keep as many people on the latest versions of Spark (though maybe not the latest APIs of Spark) as possible.


As Spark grows, I think compatibility actually becomes more important and we should be more conservative rather than less. Today, there are very likely more Spark programs running than there were at any other time in the past. Spark is no longer a tool only used by advanced hackers, it is now also running "traditional enterprise workloads.'' In many cases these jobs are powering important processes long after the original author leaves.


Broken APIs can also affect libraries that extend Spark. This dependency can be even harder for users, as if the library has not been upgraded to use new APIs and they need that library, they are stuck.


Given all of this, I'd like to propose the following rubric as an addition to our semantic versioning policy. After discussion and if people agree this is a good idea, I'll call a vote of the PMC to ratify its inclusion in the official policy.


Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.


Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:

  • Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate: 

    • How long has the API been in Spark?

    • Is the API common even for basic programs?

    • How often do we see recent questions in JIRA or mailing lists?

    • How often does it appear in StackOverflow or blogs?

  • Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:

    • Will there be a compiler or linker error?

    • Will there be a runtime exception?

    • Will that exception happen after significant processing has been done?

    • Will we silently return different answers? (very hard to debug, might not even notice!)


Cost of Maintaining an API

Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.

  • Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.

  • User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.


Alternatives to Breaking an API

In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.


  • Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.

  • Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.

  • Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.

  • Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.


Examples


Here are some examples of how I think the policy above could be applied to different issues that have been discussed recently. These are only to illustrate how to apply the above rubric, but are not intended to be part of the official policy.


[SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow multiple creation of SparkContexts #23311


  • Cost to Break - Multiple Contexts in a single JVM never worked properly. When users tried it they would nearly always report that Spark was broken (SPARK-2243), due to the confusing set of logs messages. Given this, I think it is very unlikely that there are many real world use cases active today. Even those cases likely suffer from undiagnosed issues as there are many areas of Spark that assume a single context per JVM.

  • Cost to Maintain - We have recently had users ask on the mailing list if this was supported, as the conf led them to believe it was, and the existence of this configuration as "supported" makes it harder to reason about certain global state in SparkContext.


Decision: Remove this configuration and related code.


[SPARK-25908] Remove registerTempTable #22921 (only looking at one API of this PR)


  • Cost to Break - This is a wildly popular API of Spark SQL that has been there since the first release. There are tons of blog posts and examples that use this syntax if you google "dataframe registerTempTable" (even more than the "correct" API "dataframe createOrReplaceView"). All of these will be invalid for users of Spark 3.0

  • Cost to Maintain - This is just an alias, so there is not a lot of extra machinery required to keep the API. Users have two ways to do the same thing, but we can note that this is just an alias in the docs.


Decision: Do not remove this API, I would even consider un-deprecating it. I anecdotally asked several users and this is the API they prefer over the "correct" one.

[SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195

  • Cost to Break - I think that this case actually exemplifies several anti-patterns in breaking APIs. In some languages, the deprecation warning gives you no help, other than what version the function was removed in. In R, it points users to a really deep conversation on the semantics of time in Spark SQL. None of the messages tell you how you should correctly be parsing a timestamp that is given to you in a format other than UTC. My guess is all users will blindly flip the flag to true (to keep using this function), so you've only succeeded in annoying them.

  • Cost to Maintain - These are two relatively isolated expressions, there should be little cost to keeping them. Users can be confused by their semantics, so we probably should update the docs to point them to a best practice (I learned only by complaining on the PR, that a good practice is to parse timestamps including the timezone in the format expression, which naturally shifts them to UTC).


Decision: Do not deprecate these two functions. We should update the docs to talk about best practices for parsing timestamps, including how to correctly shift them to UTC for storage.


[SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902


  • Cost to Break - The TRIM function takes two string parameters. If we switch the parameter order, queries that use the TRIM function would silently get different results on different versions of Spark. Users may not notice it for a long time and wrong query results may cause serious problems to users.

  • Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM function in Scala API and in SQL have different parameter order.


Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a warning, not by removing) the SQL TRIM function and move users to the SQL standard TRIM syntax.


Thanks for taking the time to read this! Happy to discuss the specifics and amend this policy as the community sees fit.


Michael


Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

John Zhuge
Well written, Michael!

Believe it or not, I read through the entire email, very rare for emails of such length. Happy to see healthy discussions on this tough subject. Definitely need perspectives form both the users and the contributors.


On Tue, Feb 25, 2020 at 9:09 PM Xiao Li <[hidden email]> wrote:
+1 

Xiao

Michael Armbrust <[hidden email]> 于2020年2月24日周一 下午3:03写道:

Hello Everyone,


As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic versioning, so this major release is our chance to get it right [by breaking APIs]". Similarly, in many cases the response to questions about why an API was completely removed has been, "this API has been deprecated since x.x, so we have to remove it".


As a long time contributor to and user of Spark this interpretation of the policy is concerning to me. This reasoning misses the intention of the original policy, and I am worried that it will hurt the long-term success of the project.


I definitely understand that these are hard decisions, and I'm not proposing that we never remove anything from Spark. However, I would like to give some additional context and also propose a different rubric for thinking about API breakage moving forward.


Spark adopted semantic versioning back in 2014 during the preparations for the 1.0 release. As this was the first major release -- and as, up until fairly recently, Spark had only been an academic project -- no real promises had been made about API stability ever.


During the discussion, some committers suggested that this was an opportunity to clean up cruft and give the Spark APIs a once-over, making cosmetic changes to improve consistency. However, in the end, it was decided that in many cases it was not in the best interests of the Spark community to break things just because we could. Matei actually said it pretty forcefully:


I know that some names are suboptimal, but I absolutely detest breaking APIs, config names, etc. I’ve seen it happen way too often in other projects (even things we depend on that are officially post-1.0, like Akka or Protobuf or Hadoop), and it’s very painful. I think that we as fairly cutting-edge users are okay with libraries occasionally changing, but many others will consider it a show-stopper. Given this, I think that any cosmetic change now, even though it might improve clarity slightly, is not worth the tradeoff in terms of creating an update barrier for existing users.


In the end, while some changes were made, most APIs remained the same and users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served the project very well, as compatibility means users are able to upgrade and we keep as many people on the latest versions of Spark (though maybe not the latest APIs of Spark) as possible.


As Spark grows, I think compatibility actually becomes more important and we should be more conservative rather than less. Today, there are very likely more Spark programs running than there were at any other time in the past. Spark is no longer a tool only used by advanced hackers, it is now also running "traditional enterprise workloads.'' In many cases these jobs are powering important processes long after the original author leaves.


Broken APIs can also affect libraries that extend Spark. This dependency can be even harder for users, as if the library has not been upgraded to use new APIs and they need that library, they are stuck.


Given all of this, I'd like to propose the following rubric as an addition to our semantic versioning policy. After discussion and if people agree this is a good idea, I'll call a vote of the PMC to ratify its inclusion in the official policy.


Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.


Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:

  • Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate: 

    • How long has the API been in Spark?

    • Is the API common even for basic programs?

    • How often do we see recent questions in JIRA or mailing lists?

    • How often does it appear in StackOverflow or blogs?

  • Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:

    • Will there be a compiler or linker error?

    • Will there be a runtime exception?

    • Will that exception happen after significant processing has been done?

    • Will we silently return different answers? (very hard to debug, might not even notice!)


Cost of Maintaining an API

Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.

  • Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.

  • User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.


Alternatives to Breaking an API

In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.


  • Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.

  • Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.

  • Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.

  • Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.


Examples


Here are some examples of how I think the policy above could be applied to different issues that have been discussed recently. These are only to illustrate how to apply the above rubric, but are not intended to be part of the official policy.


[SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow multiple creation of SparkContexts #23311


  • Cost to Break - Multiple Contexts in a single JVM never worked properly. When users tried it they would nearly always report that Spark was broken (SPARK-2243), due to the confusing set of logs messages. Given this, I think it is very unlikely that there are many real world use cases active today. Even those cases likely suffer from undiagnosed issues as there are many areas of Spark that assume a single context per JVM.

  • Cost to Maintain - We have recently had users ask on the mailing list if this was supported, as the conf led them to believe it was, and the existence of this configuration as "supported" makes it harder to reason about certain global state in SparkContext.


Decision: Remove this configuration and related code.


[SPARK-25908] Remove registerTempTable #22921 (only looking at one API of this PR)


  • Cost to Break - This is a wildly popular API of Spark SQL that has been there since the first release. There are tons of blog posts and examples that use this syntax if you google "dataframe registerTempTable" (even more than the "correct" API "dataframe createOrReplaceView"). All of these will be invalid for users of Spark 3.0

  • Cost to Maintain - This is just an alias, so there is not a lot of extra machinery required to keep the API. Users have two ways to do the same thing, but we can note that this is just an alias in the docs.


Decision: Do not remove this API, I would even consider un-deprecating it. I anecdotally asked several users and this is the API they prefer over the "correct" one.

[SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195

  • Cost to Break - I think that this case actually exemplifies several anti-patterns in breaking APIs. In some languages, the deprecation warning gives you no help, other than what version the function was removed in. In R, it points users to a really deep conversation on the semantics of time in Spark SQL. None of the messages tell you how you should correctly be parsing a timestamp that is given to you in a format other than UTC. My guess is all users will blindly flip the flag to true (to keep using this function), so you've only succeeded in annoying them.

  • Cost to Maintain - These are two relatively isolated expressions, there should be little cost to keeping them. Users can be confused by their semantics, so we probably should update the docs to point them to a best practice (I learned only by complaining on the PR, that a good practice is to parse timestamps including the timezone in the format expression, which naturally shifts them to UTC).


Decision: Do not deprecate these two functions. We should update the docs to talk about best practices for parsing timestamps, including how to correctly shift them to UTC for storage.


[SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902


  • Cost to Break - The TRIM function takes two string parameters. If we switch the parameter order, queries that use the TRIM function would silently get different results on different versions of Spark. Users may not notice it for a long time and wrong query results may cause serious problems to users.

  • Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM function in Scala API and in SQL have different parameter order.


Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a warning, not by removing) the SQL TRIM function and move users to the SQL standard TRIM syntax.


Thanks for taking the time to read this! Happy to discuss the specifics and amend this policy as the community sees fit.


Michael




--
John Zhuge
mmb
Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

mmb
In reply to this post by Michael Armbrust
+1

_________________________________________________________
Michel Miotto Barbosa, Data Science/Software Engineer
Learn MBA Global Financial Broker at IBMEC SP, 
Learn Economic Science at PUC SP
MBA in Project Management, Graduate i
n
S
oftware
E
ngineering
phone: +55 11 984 342 347,
@michelmb


Em seg., 24 de fev. de 2020 às 20:03, Michael Armbrust <[hidden email]> escreveu:

Hello Everyone,


As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic versioning, so this major release is our chance to get it right [by breaking APIs]". Similarly, in many cases the response to questions about why an API was completely removed has been, "this API has been deprecated since x.x, so we have to remove it".


As a long time contributor to and user of Spark this interpretation of the policy is concerning to me. This reasoning misses the intention of the original policy, and I am worried that it will hurt the long-term success of the project.


I definitely understand that these are hard decisions, and I'm not proposing that we never remove anything from Spark. However, I would like to give some additional context and also propose a different rubric for thinking about API breakage moving forward.


Spark adopted semantic versioning back in 2014 during the preparations for the 1.0 release. As this was the first major release -- and as, up until fairly recently, Spark had only been an academic project -- no real promises had been made about API stability ever.


During the discussion, some committers suggested that this was an opportunity to clean up cruft and give the Spark APIs a once-over, making cosmetic changes to improve consistency. However, in the end, it was decided that in many cases it was not in the best interests of the Spark community to break things just because we could. Matei actually said it pretty forcefully:


I know that some names are suboptimal, but I absolutely detest breaking APIs, config names, etc. I’ve seen it happen way too often in other projects (even things we depend on that are officially post-1.0, like Akka or Protobuf or Hadoop), and it’s very painful. I think that we as fairly cutting-edge users are okay with libraries occasionally changing, but many others will consider it a show-stopper. Given this, I think that any cosmetic change now, even though it might improve clarity slightly, is not worth the tradeoff in terms of creating an update barrier for existing users.


In the end, while some changes were made, most APIs remained the same and users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served the project very well, as compatibility means users are able to upgrade and we keep as many people on the latest versions of Spark (though maybe not the latest APIs of Spark) as possible.


As Spark grows, I think compatibility actually becomes more important and we should be more conservative rather than less. Today, there are very likely more Spark programs running than there were at any other time in the past. Spark is no longer a tool only used by advanced hackers, it is now also running "traditional enterprise workloads.'' In many cases these jobs are powering important processes long after the original author leaves.


Broken APIs can also affect libraries that extend Spark. This dependency can be even harder for users, as if the library has not been upgraded to use new APIs and they need that library, they are stuck.


Given all of this, I'd like to propose the following rubric as an addition to our semantic versioning policy. After discussion and if people agree this is a good idea, I'll call a vote of the PMC to ratify its inclusion in the official policy.


Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.


Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:

  • Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate: 

    • How long has the API been in Spark?

    • Is the API common even for basic programs?

    • How often do we see recent questions in JIRA or mailing lists?

    • How often does it appear in StackOverflow or blogs?

  • Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:

    • Will there be a compiler or linker error?

    • Will there be a runtime exception?

    • Will that exception happen after significant processing has been done?

    • Will we silently return different answers? (very hard to debug, might not even notice!)


Cost of Maintaining an API

Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.

  • Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.

  • User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.


Alternatives to Breaking an API

In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.


  • Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.

  • Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.

  • Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.

  • Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.


Examples


Here are some examples of how I think the policy above could be applied to different issues that have been discussed recently. These are only to illustrate how to apply the above rubric, but are not intended to be part of the official policy.


[SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow multiple creation of SparkContexts #23311


  • Cost to Break - Multiple Contexts in a single JVM never worked properly. When users tried it they would nearly always report that Spark was broken (SPARK-2243), due to the confusing set of logs messages. Given this, I think it is very unlikely that there are many real world use cases active today. Even those cases likely suffer from undiagnosed issues as there are many areas of Spark that assume a single context per JVM.

  • Cost to Maintain - We have recently had users ask on the mailing list if this was supported, as the conf led them to believe it was, and the existence of this configuration as "supported" makes it harder to reason about certain global state in SparkContext.


Decision: Remove this configuration and related code.


[SPARK-25908] Remove registerTempTable #22921 (only looking at one API of this PR)


  • Cost to Break - This is a wildly popular API of Spark SQL that has been there since the first release. There are tons of blog posts and examples that use this syntax if you google "dataframe registerTempTable" (even more than the "correct" API "dataframe createOrReplaceView"). All of these will be invalid for users of Spark 3.0

  • Cost to Maintain - This is just an alias, so there is not a lot of extra machinery required to keep the API. Users have two ways to do the same thing, but we can note that this is just an alias in the docs.


Decision: Do not remove this API, I would even consider un-deprecating it. I anecdotally asked several users and this is the API they prefer over the "correct" one.

[SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195

  • Cost to Break - I think that this case actually exemplifies several anti-patterns in breaking APIs. In some languages, the deprecation warning gives you no help, other than what version the function was removed in. In R, it points users to a really deep conversation on the semantics of time in Spark SQL. None of the messages tell you how you should correctly be parsing a timestamp that is given to you in a format other than UTC. My guess is all users will blindly flip the flag to true (to keep using this function), so you've only succeeded in annoying them.

  • Cost to Maintain - These are two relatively isolated expressions, there should be little cost to keeping them. Users can be confused by their semantics, so we probably should update the docs to point them to a best practice (I learned only by complaining on the PR, that a good practice is to parse timestamps including the timezone in the format expression, which naturally shifts them to UTC).


Decision: Do not deprecate these two functions. We should update the docs to talk about best practices for parsing timestamps, including how to correctly shift them to UTC for storage.


[SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902


  • Cost to Break - The TRIM function takes two string parameters. If we switch the parameter order, queries that use the TRIM function would silently get different results on different versions of Spark. Users may not notice it for a long time and wrong query results may cause serious problems to users.

  • Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM function in Scala API and in SQL have different parameter order.


Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a warning, not by removing) the SQL TRIM function and move users to the SQL standard TRIM syntax.


Thanks for taking the time to read this! Happy to discuss the specifics and amend this policy as the community sees fit.


Michael


Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Jules Damji-2
In reply to this post by Michael Armbrust
+1 

Well said! 

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 24, 2020, at 3:03 PM, Michael Armbrust <[hidden email]> wrote:



Hello Everyone,


As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic versioning, so this major release is our chance to get it right [by breaking APIs]". Similarly, in many cases the response to questions about why an API was completely removed has been, "this API has been deprecated since x.x, so we have to remove it".


As a long time contributor to and user of Spark this interpretation of the policy is concerning to me. This reasoning misses the intention of the original policy, and I am worried that it will hurt the long-term success of the project.


I definitely understand that these are hard decisions, and I'm not proposing that we never remove anything from Spark. However, I would like to give some additional context and also propose a different rubric for thinking about API breakage moving forward.


Spark adopted semantic versioning back in 2014 during the preparations for the 1.0 release. As this was the first major release -- and as, up until fairly recently, Spark had only been an academic project -- no real promises had been made about API stability ever.


During the discussion, some committers suggested that this was an opportunity to clean up cruft and give the Spark APIs a once-over, making cosmetic changes to improve consistency. However, in the end, it was decided that in many cases it was not in the best interests of the Spark community to break things just because we could. Matei actually said it pretty forcefully:


I know that some names are suboptimal, but I absolutely detest breaking APIs, config names, etc. I’ve seen it happen way too often in other projects (even things we depend on that are officially post-1.0, like Akka or Protobuf or Hadoop), and it’s very painful. I think that we as fairly cutting-edge users are okay with libraries occasionally changing, but many others will consider it a show-stopper. Given this, I think that any cosmetic change now, even though it might improve clarity slightly, is not worth the tradeoff in terms of creating an update barrier for existing users.


In the end, while some changes were made, most APIs remained the same and users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served the project very well, as compatibility means users are able to upgrade and we keep as many people on the latest versions of Spark (though maybe not the latest APIs of Spark) as possible.


As Spark grows, I think compatibility actually becomes more important and we should be more conservative rather than less. Today, there are very likely more Spark programs running than there were at any other time in the past. Spark is no longer a tool only used by advanced hackers, it is now also running "traditional enterprise workloads.'' In many cases these jobs are powering important processes long after the original author leaves.


Broken APIs can also affect libraries that extend Spark. This dependency can be even harder for users, as if the library has not been upgraded to use new APIs and they need that library, they are stuck.


Given all of this, I'd like to propose the following rubric as an addition to our semantic versioning policy. After discussion and if people agree this is a good idea, I'll call a vote of the PMC to ratify its inclusion in the official policy.


Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.


Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:

  • Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate: 

    • How long has the API been in Spark?

    • Is the API common even for basic programs?

    • How often do we see recent questions in JIRA or mailing lists?

    • How often does it appear in StackOverflow or blogs?

  • Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:

    • Will there be a compiler or linker error?

    • Will there be a runtime exception?

    • Will that exception happen after significant processing has been done?

    • Will we silently return different answers? (very hard to debug, might not even notice!)


Cost of Maintaining an API

Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.

  • Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.

  • User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.


Alternatives to Breaking an API

In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.


  • Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.

  • Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.

  • Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.

  • Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.


Examples


Here are some examples of how I think the policy above could be applied to different issues that have been discussed recently. These are only to illustrate how to apply the above rubric, but are not intended to be part of the official policy.


[SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow multiple creation of SparkContexts #23311


  • Cost to Break - Multiple Contexts in a single JVM never worked properly. When users tried it they would nearly always report that Spark was broken (SPARK-2243), due to the confusing set of logs messages. Given this, I think it is very unlikely that there are many real world use cases active today. Even those cases likely suffer from undiagnosed issues as there are many areas of Spark that assume a single context per JVM.

  • Cost to Maintain - We have recently had users ask on the mailing list if this was supported, as the conf led them to believe it was, and the existence of this configuration as "supported" makes it harder to reason about certain global state in SparkContext.


Decision: Remove this configuration and related code.


[SPARK-25908] Remove registerTempTable #22921 (only looking at one API of this PR)


  • Cost to Break - This is a wildly popular API of Spark SQL that has been there since the first release. There are tons of blog posts and examples that use this syntax if you google "dataframe registerTempTable" (even more than the "correct" API "dataframe createOrReplaceView"). All of these will be invalid for users of Spark 3.0

  • Cost to Maintain - This is just an alias, so there is not a lot of extra machinery required to keep the API. Users have two ways to do the same thing, but we can note that this is just an alias in the docs.


Decision: Do not remove this API, I would even consider un-deprecating it. I anecdotally asked several users and this is the API they prefer over the "correct" one.

[SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195

  • Cost to Break - I think that this case actually exemplifies several anti-patterns in breaking APIs. In some languages, the deprecation warning gives you no help, other than what version the function was removed in. In R, it points users to a really deep conversation on the semantics of time in Spark SQL. None of the messages tell you how you should correctly be parsing a timestamp that is given to you in a format other than UTC. My guess is all users will blindly flip the flag to true (to keep using this function), so you've only succeeded in annoying them.

  • Cost to Maintain - These are two relatively isolated expressions, there should be little cost to keeping them. Users can be confused by their semantics, so we probably should update the docs to point them to a best practice (I learned only by complaining on the PR, that a good practice is to parse timestamps including the timezone in the format expression, which naturally shifts them to UTC).


Decision: Do not deprecate these two functions. We should update the docs to talk about best practices for parsing timestamps, including how to correctly shift them to UTC for storage.


[SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902


  • Cost to Break - The TRIM function takes two string parameters. If we switch the parameter order, queries that use the TRIM function would silently get different results on different versions of Spark. Users may not notice it for a long time and wrong query results may cause serious problems to users.

  • Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM function in Scala API and in SQL have different parameter order.


Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a warning, not by removing) the SQL TRIM function and move users to the SQL standard TRIM syntax.


Thanks for taking the time to read this! Happy to discuss the specifics and amend this policy as the community sees fit.


Michael


Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Sean Owen-3
In reply to this post by Michael Armbrust
Those are all quite reasonable guidelines and I'd put them into the
contributing or developer guide, sure.
Although not argued here, I think we should go further than codifying
and enforcing common-sense guidelines like these. I think bias should
shift in favor of retaining APIs going forward, and even retroactively
shift for 3.0 somewhat. (Hence some reverts currently in progress.)
It's a natural evolution from 1.x to 2.x to 3.x. The API surface area
stops expanding and changing and getting fixed as much; years more
experience prove out what APIs make sense.

On Mon, Feb 24, 2020 at 5:03 PM Michael Armbrust <[hidden email]> wrote:

>
> Hello Everyone,
>
>
> As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic versioning, so this major release is our chance to get it right [by breaking APIs]". Similarly, in many cases the response to questions about why an API was completely removed has been, "this API has been deprecated since x.x, so we have to remove it".
>
>
> As a long time contributor to and user of Spark this interpretation of the policy is concerning to me. This reasoning misses the intention of the original policy, and I am worried that it will hurt the long-term success of the project.
>
>
> I definitely understand that these are hard decisions, and I'm not proposing that we never remove anything from Spark. However, I would like to give some additional context and also propose a different rubric for thinking about API breakage moving forward.
>
>
> Spark adopted semantic versioning back in 2014 during the preparations for the 1.0 release. As this was the first major release -- and as, up until fairly recently, Spark had only been an academic project -- no real promises had been made about API stability ever.
>
>
> During the discussion, some committers suggested that this was an opportunity to clean up cruft and give the Spark APIs a once-over, making cosmetic changes to improve consistency. However, in the end, it was decided that in many cases it was not in the best interests of the Spark community to break things just because we could. Matei actually said it pretty forcefully:
>
>
> I know that some names are suboptimal, but I absolutely detest breaking APIs, config names, etc. I’ve seen it happen way too often in other projects (even things we depend on that are officially post-1.0, like Akka or Protobuf or Hadoop), and it’s very painful. I think that we as fairly cutting-edge users are okay with libraries occasionally changing, but many others will consider it a show-stopper. Given this, I think that any cosmetic change now, even though it might improve clarity slightly, is not worth the tradeoff in terms of creating an update barrier for existing users.
>
>
> In the end, while some changes were made, most APIs remained the same and users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served the project very well, as compatibility means users are able to upgrade and we keep as many people on the latest versions of Spark (though maybe not the latest APIs of Spark) as possible.
>
>
> As Spark grows, I think compatibility actually becomes more important and we should be more conservative rather than less. Today, there are very likely more Spark programs running than there were at any other time in the past. Spark is no longer a tool only used by advanced hackers, it is now also running "traditional enterprise workloads.'' In many cases these jobs are powering important processes long after the original author leaves.
>
>
> Broken APIs can also affect libraries that extend Spark. This dependency can be even harder for users, as if the library has not been upgraded to use new APIs and they need that library, they are stuck.
>
>
> Given all of this, I'd like to propose the following rubric as an addition to our semantic versioning policy. After discussion and if people agree this is a good idea, I'll call a vote of the PMC to ratify its inclusion in the official policy.
>
>
> Considerations When Breaking APIs
>
> The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.
>
>
> Cost of Breaking an API
>
> Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:
>
> Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate:
>
> How long has the API been in Spark?
>
> Is the API common even for basic programs?
>
> How often do we see recent questions in JIRA or mailing lists?
>
> How often does it appear in StackOverflow or blogs?
>
> Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:
>
> Will there be a compiler or linker error?
>
> Will there be a runtime exception?
>
> Will that exception happen after significant processing has been done?
>
> Will we silently return different answers? (very hard to debug, might not even notice!)
>
>
> Cost of Maintaining an API
>
> Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.
>
> Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.
>
> User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.
>
>
> Alternatives to Breaking an API
>
> In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.
>
>
> Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.
>
> Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.
>
> Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.
>
> Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.
>
>
> Examples
>
>
> Here are some examples of how I think the policy above could be applied to different issues that have been discussed recently. These are only to illustrate how to apply the above rubric, but are not intended to be part of the official policy.
>
>
> [SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow multiple creation of SparkContexts #23311
>
>
> Cost to Break - Multiple Contexts in a single JVM never worked properly. When users tried it they would nearly always report that Spark was broken (SPARK-2243), due to the confusing set of logs messages. Given this, I think it is very unlikely that there are many real world use cases active today. Even those cases likely suffer from undiagnosed issues as there are many areas of Spark that assume a single context per JVM.
>
> Cost to Maintain - We have recently had users ask on the mailing list if this was supported, as the conf led them to believe it was, and the existence of this configuration as "supported" makes it harder to reason about certain global state in SparkContext.
>
>
> Decision: Remove this configuration and related code.
>
>
> [SPARK-25908] Remove registerTempTable #22921 (only looking at one API of this PR)
>
>
> Cost to Break - This is a wildly popular API of Spark SQL that has been there since the first release. There are tons of blog posts and examples that use this syntax if you google "dataframe registerTempTable" (even more than the "correct" API "dataframe createOrReplaceView"). All of these will be invalid for users of Spark 3.0
>
> Cost to Maintain - This is just an alias, so there is not a lot of extra machinery required to keep the API. Users have two ways to do the same thing, but we can note that this is just an alias in the docs.
>
>
> Decision: Do not remove this API, I would even consider un-deprecating it. I anecdotally asked several users and this is the API they prefer over the "correct" one.
>
> [SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195
>
> Cost to Break - I think that this case actually exemplifies several anti-patterns in breaking APIs. In some languages, the deprecation warning gives you no help, other than what version the function was removed in. In R, it points users to a really deep conversation on the semantics of time in Spark SQL. None of the messages tell you how you should correctly be parsing a timestamp that is given to you in a format other than UTC. My guess is all users will blindly flip the flag to true (to keep using this function), so you've only succeeded in annoying them.
>
> Cost to Maintain - These are two relatively isolated expressions, there should be little cost to keeping them. Users can be confused by their semantics, so we probably should update the docs to point them to a best practice (I learned only by complaining on the PR, that a good practice is to parse timestamps including the timezone in the format expression, which naturally shifts them to UTC).
>
>
> Decision: Do not deprecate these two functions. We should update the docs to talk about best practices for parsing timestamps, including how to correctly shift them to UTC for storage.
>
>
> [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902
>
>
> Cost to Break - The TRIM function takes two string parameters. If we switch the parameter order, queries that use the TRIM function would silently get different results on different versions of Spark. Users may not notice it for a long time and wrong query results may cause serious problems to users.
>
> Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM function in Scala API and in SQL have different parameter order.
>
>
> Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a warning, not by removing) the SQL TRIM function and move users to the SQL standard TRIM syntax.
>
>
> Thanks for taking the time to read this! Happy to discuss the specifics and amend this policy as the community sees fit.
>
>
> Michael
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Tom Graves-2
In reply to this post by Michael Armbrust
In general +1 I think these are good guidelines and making it easier to upgrade is beneficial to everyone.  The decision needs to happen at api/config change time, otherwise the deprecated warning has no purpose if we are never going to remove them.
That said we still need to be able to remove deprecated things and change APIs in major releases, otherwise why do a  major release in the first place.  Is it purely to support newer Scala/python/java versions.

I think the hardest part listed here is what the impact is.  Who's call is that, it's hard to know how everyone is using things and I think it's been harder to get feedback on SPIPs and API changes in general as people are busy with other things. Like you mention, I think stackoverflow is unreliable, the posts could be many years old and no longer relevant. 

Tom
On Monday, February 24, 2020, 05:03:44 PM CST, Michael Armbrust <[hidden email]> wrote:


Hello Everyone,


As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic versioning, so this major release is our chance to get it right [by breaking APIs]". Similarly, in many cases the response to questions about why an API was completely removed has been, "this API has been deprecated since x.x, so we have to remove it".


As a long time contributor to and user of Spark this interpretation of the policy is concerning to me. This reasoning misses the intention of the original policy, and I am worried that it will hurt the long-term success of the project.


I definitely understand that these are hard decisions, and I'm not proposing that we never remove anything from Spark. However, I would like to give some additional context and also propose a different rubric for thinking about API breakage moving forward.


Spark adopted semantic versioning back in 2014 during the preparations for the 1.0 release. As this was the first major release -- and as, up until fairly recently, Spark had only been an academic project -- no real promises had been made about API stability ever.


During the discussion, some committers suggested that this was an opportunity to clean up cruft and give the Spark APIs a once-over, making cosmetic changes to improve consistency. However, in the end, it was decided that in many cases it was not in the best interests of the Spark community to break things just because we could. Matei actually said it pretty forcefully:


I know that some names are suboptimal, but I absolutely detest breaking APIs, config names, etc. I’ve seen it happen way too often in other projects (even things we depend on that are officially post-1.0, like Akka or Protobuf or Hadoop), and it’s very painful. I think that we as fairly cutting-edge users are okay with libraries occasionally changing, but many others will consider it a show-stopper. Given this, I think that any cosmetic change now, even though it might improve clarity slightly, is not worth the tradeoff in terms of creating an update barrier for existing users.


In the end, while some changes were made, most APIs remained the same and users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served the project very well, as compatibility means users are able to upgrade and we keep as many people on the latest versions of Spark (though maybe not the latest APIs of Spark) as possible.


As Spark grows, I think compatibility actually becomes more important and we should be more conservative rather than less. Today, there are very likely more Spark programs running than there were at any other time in the past. Spark is no longer a tool only used by advanced hackers, it is now also running "traditional enterprise workloads.'' In many cases these jobs are powering important processes long after the original author leaves.


Broken APIs can also affect libraries that extend Spark. This dependency can be even harder for users, as if the library has not been upgraded to use new APIs and they need that library, they are stuck.


Given all of this, I'd like to propose the following rubric as an addition to our semantic versioning policy. After discussion and if people agree this is a good idea, I'll call a vote of the PMC to ratify its inclusion in the official policy.


Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.


Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:

  • Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate: 

    • How long has the API been in Spark?

    • Is the API common even for basic programs?

    • How often do we see recent questions in JIRA or mailing lists?

    • How often does it appear in StackOverflow or blogs?

  • Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:

    • Will there be a compiler or linker error?

    • Will there be a runtime exception?

    • Will that exception happen after significant processing has been done?

    • Will we silently return different answers? (very hard to debug, might not even notice!)


Cost of Maintaining an API

Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.

  • Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.

  • User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.


Alternatives to Breaking an API

In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.


  • Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.

  • Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.

  • Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.

  • Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.


Examples


Here are some examples of how I think the policy above could be applied to different issues that have been discussed recently. These are only to illustrate how to apply the above rubric, but are not intended to be part of the official policy.


[SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow multiple creation of SparkContexts #23311


  • Cost to Break - Multiple Contexts in a single JVM never worked properly. When users tried it they would nearly always report that Spark was broken (SPARK-2243), due to the confusing set of logs messages. Given this, I think it is very unlikely that there are many real world use cases active today. Even those cases likely suffer from undiagnosed issues as there are many areas of Spark that assume a single context per JVM.

  • Cost to Maintain - We have recently had users ask on the mailing list if this was supported, as the conf led them to believe it was, and the existence of this configuration as "supported" makes it harder to reason about certain global state in SparkContext.


Decision: Remove this configuration and related code.


[SPARK-25908] Remove registerTempTable #22921 (only looking at one API of this PR)


  • Cost to Break - This is a wildly popular API of Spark SQL that has been there since the first release. There are tons of blog posts and examples that use this syntax if you google "dataframe registerTempTable" (even more than the "correct" API "dataframe createOrReplaceView"). All of these will be invalid for users of Spark 3.0

  • Cost to Maintain - This is just an alias, so there is not a lot of extra machinery required to keep the API. Users have two ways to do the same thing, but we can note that this is just an alias in the docs.


Decision: Do not remove this API, I would even consider un-deprecating it. I anecdotally asked several users and this is the API they prefer over the "correct" one.

[SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195

  • Cost to Break - I think that this case actually exemplifies several anti-patterns in breaking APIs. In some languages, the deprecation warning gives you no help, other than what version the function was removed in. In R, it points users to a really deep conversation on the semantics of time in Spark SQL. None of the messages tell you how you should correctly be parsing a timestamp that is given to you in a format other than UTC. My guess is all users will blindly flip the flag to true (to keep using this function), so you've only succeeded in annoying them.

  • Cost to Maintain - These are two relatively isolated expressions, there should be little cost to keeping them. Users can be confused by their semantics, so we probably should update the docs to point them to a best practice (I learned only by complaining on the PR, that a good practice is to parse timestamps including the timezone in the format expression, which naturally shifts them to UTC).


Decision: Do not deprecate these two functions. We should update the docs to talk about best practices for parsing timestamps, including how to correctly shift them to UTC for storage.


[SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902


  • Cost to Break - The TRIM function takes two string parameters. If we switch the parameter order, queries that use the TRIM function would silently get different results on different versions of Spark. Users may not notice it for a long time and wrong query results may cause serious problems to users.

  • Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM function in Scala API and in SQL have different parameter order.


Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a warning, not by removing) the SQL TRIM function and move users to the SQL standard TRIM syntax.


Thanks for taking the time to read this! Happy to discuss the specifics and amend this policy as the community sees fit.


Michael


Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Michael Armbrust
Thanks for the discussion! A few responses:

The decision needs to happen at api/config change time, otherwise the deprecated warning has no purpose if we are never going to remove them.

Even if we never remove an API, I think deprecation warnings (when done right) can still serve a purpose. For new users, a deprecation can serve as a pointer to newer, faster APIs or ones with less sharp edges. I would be supportive of efforts that use them to clean up the docs. For example, we could hide deprecated APIs after some time so they don't clutter scala/java doc. We can and should audit things like the user guide and our own examples to make sure they don't use deprecated APIs.
 
That said we still need to be able to remove deprecated things and change APIs in major releases, otherwise why do a  major release in the first place.  Is it purely to support newer Scala/python/java versions.

I don't think Major versions are purely for Scala/Java/Python/Hive/Metastore, but they are a good chance to move the project forward. Spark 3.0 has a lot of upgrades here, and I think we did make the right trade-offs here, even though there are some API breaks.

Major versions are also a good time to drop major changes (i.e. in 2.0 we released whole-stage code gen).
 
I think the hardest part listed here is what the impact is.  Who's call is that, it's hard to know how everyone is using things and I think it's been harder to get feedback on SPIPs and API changes in general as people are busy with other things.

This is the hardest part, and we won't always get it right. I think that having the rubric though will help guide the conversation and help reviewers ask the right questions.

One other thing I'll add is, sometimes the users come to us and we should listen! I was very surprised by the response to Karen's email on this list last week. An actual user was giving us feedback on the impact of the changes in Spark 3.0 and rather than listen there was a lot of push back. Users are never wrong when they are telling you what matters to them!
 
Like you mention, I think stackoverflow is unreliable, the posts could be many years old and no longer relevant.

While this is unfortunate, I think the more we can do to keep these answer relevant (either by updating them or by not breaking them) is good for the health of the Spark community.
Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Matei Zaharia
Administrator
In reply to this post by Michael Armbrust
+1 on this new rubric. It definitely captures the issues I’ve seen in Spark and in other projects. If we write down this rubric (or something like it), it will also be easier to refer to it during code reviews or in proposals of new APIs (we could ask “do you expect to have to change this API in the future, and if so, how”).

Matei

On Feb 24, 2020, at 3:02 PM, Michael Armbrust <[hidden email]> wrote:

Hello Everyone,

As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic versioning, so this major release is our chance to get it right [by breaking APIs]". Similarly, in many cases the response to questions about why an API was completely removed has been, "this API has been deprecated since x.x, so we have to remove it".

As a long time contributor to and user of Spark this interpretation of the policy is concerning to me. This reasoning misses the intention of the original policy, and I am worried that it will hurt the long-term success of the project.

I definitely understand that these are hard decisions, and I'm not proposing that we never remove anything from Spark. However, I would like to give some additional context and also propose a different rubric for thinking about API breakage moving forward.

Spark adopted semantic versioning back in 2014 during the preparations for the 1.0 release. As this was the first major release -- and as, up until fairly recently, Spark had only been an academic project -- no real promises had been made about API stability ever.

During the discussion, some committers suggested that this was an opportunity to clean up cruft and give the Spark APIs a once-over, making cosmetic changes to improve consistency. However, in the end, it was decided that in many cases it was not in the best interests of the Spark community to break things just because we could. Matei actually said it pretty forcefully:

I know that some names are suboptimal, but I absolutely detest breaking APIs, config names, etc. I’ve seen it happen way too often in other projects (even things we depend on that are officially post-1.0, like Akka or Protobuf or Hadoop), and it’s very painful. I think that we as fairly cutting-edge users are okay with libraries occasionally changing, but many others will consider it a show-stopper. Given this, I think that any cosmetic change now, even though it might improve clarity slightly, is not worth the tradeoff in terms of creating an update barrier for existing users.

In the end, while some changes were made, most APIs remained the same and users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served the project very well, as compatibility means users are able to upgrade and we keep as many people on the latest versions of Spark (though maybe not the latest APIs of Spark) as possible.

As Spark grows, I think compatibility actually becomes more important and we should be more conservative rather than less. Today, there are very likely more Spark programs running than there were at any other time in the past. Spark is no longer a tool only used by advanced hackers, it is now also running "traditional enterprise workloads.'' In many cases these jobs are powering important processes long after the original author leaves.

Broken APIs can also affect libraries that extend Spark. This dependency can be even harder for users, as if the library has not been upgraded to use new APIs and they need that library, they are stuck.

Given all of this, I'd like to propose the following rubric as an addition to our semantic versioning policy. After discussion and if people agree this is a good idea, I'll call a vote of the PMC to ratify its inclusion in the official policy.

Considerations When Breaking APIs
The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.

Cost of Breaking an API
Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:
  • Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate: 
    • How long has the API been in Spark?
    • Is the API common even for basic programs?
    • How often do we see recent questions in JIRA or mailing lists?
    • How often does it appear in StackOverflow or blogs?
  • Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:
    • Will there be a compiler or linker error?
    • Will there be a runtime exception?
    • Will that exception happen after significant processing has been done?
    • Will we silently return different answers? (very hard to debug, might not even notice!)

Cost of Maintaining an API
Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.
  • Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.
  • User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.

Alternatives to Breaking an API
In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.

  • Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.
  • Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.
  • Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.
  • Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.

Examples

Here are some examples of how I think the policy above could be applied to different issues that have been discussed recently. These are only to illustrate how to apply the above rubric, but are not intended to be part of the official policy.

[SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow multiple creation of SparkContexts #23311

  • Cost to Break - Multiple Contexts in a single JVM never worked properly. When users tried it they would nearly always report that Spark was broken (SPARK-2243), due to the confusing set of logs messages. Given this, I think it is very unlikely that there are many real world use cases active today. Even those cases likely suffer from undiagnosed issues as there are many areas of Spark that assume a single context per JVM.
  • Cost to Maintain - We have recently had users ask on the mailing list if this was supported, as the conf led them to believe it was, and the existence of this configuration as "supported" makes it harder to reason about certain global state in SparkContext.

Decision: Remove this configuration and related code.

[SPARK-25908] Remove registerTempTable #22921 (only looking at one API of this PR)

  • Cost to Break - This is a wildly popular API of Spark SQL that has been there since the first release. There are tons of blog posts and examples that use this syntax if you google "dataframe registerTempTable" (even more than the "correct" API "dataframe createOrReplaceView"). All of these will be invalid for users of Spark 3.0
  • Cost to Maintain - This is just an alias, so there is not a lot of extra machinery required to keep the API. Users have two ways to do the same thing, but we can note that this is just an alias in the docs.

Decision: Do not remove this API, I would even consider un-deprecating it. I anecdotally asked several users and this is the API they prefer over the "correct" one.

[SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195

  • Cost to Break - I think that this case actually exemplifies several anti-patterns in breaking APIs. In some languages, the deprecation warning gives you no help, other than what version the function was removed in. In R, it points users to a really deep conversation on the semantics of time in Spark SQL. None of the messages tell you how you should correctly be parsing a timestamp that is given to you in a format other than UTC. My guess is all users will blindly flip the flag to true (to keep using this function), so you've only succeeded in annoying them.
  • Cost to Maintain - These are two relatively isolated expressions, there should be little cost to keeping them. Users can be confused by their semantics, so we probably should update the docs to point them to a best practice (I learned only by complaining on the PR, that a good practice is to parse timestamps including the timezone in the format expression, which naturally shifts them to UTC).

Decision: Do not deprecate these two functions. We should update the docs to talk about best practices for parsing timestamps, including how to correctly shift them to UTC for storage.

[SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902

  • Cost to Break - The TRIM function takes two string parameters. If we switch the parameter order, queries that use the TRIM function would silently get different results on different versions of Spark. Users may not notice it for a long time and wrong query results may cause serious problems to users.
  • Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM function in Scala API and in SQL have different parameter order.

Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a warning, not by removing) the SQL TRIM function and move users to the SQL standard TRIM syntax.

Thanks for taking the time to read this! Happy to discuss the specifics and amend this policy as the community sees fit.

Michael


Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Dongjoon Hyun-2
Hi, Matei and Michael.

I'm also a big supporter for policy-based project management.

Before going further,

    1. Could you estimate how many revert commits are required in `branch-3.0` for new rubric?
    2. Are you going to revert all removed test cases for the deprecated ones?
    3. Does it make any delay for Apache Spark 3.0.0 release?
        (I believe it was previously scheduled on June before Spark Summit 2020)

Although there was a discussion already, I want to make the following tough parts sure.

    4. We are not going to add Scala 2.11 API, right?
    5. We are not going to support Python 2.x in Apache Spark 3.1+, right?
    6. Do we have enough resource for testing the deprecated ones?
        (Currently, we have 8 heavy Jenkins jobs for `branch-3.0` already.)

Especially, for (2) and (6), we know that keeping deprecated ones without testings doesn't give us any support for the new rubric.

Bests,
Dongjoon.

On Thu, Feb 27, 2020 at 5:31 PM Matei Zaharia <[hidden email]> wrote:
+1 on this new rubric. It definitely captures the issues I’ve seen in Spark and in other projects. If we write down this rubric (or something like it), it will also be easier to refer to it during code reviews or in proposals of new APIs (we could ask “do you expect to have to change this API in the future, and if so, how”).

Matei

On Feb 24, 2020, at 3:02 PM, Michael Armbrust <[hidden email]> wrote:

Hello Everyone,

As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic versioning, so this major release is our chance to get it right [by breaking APIs]". Similarly, in many cases the response to questions about why an API was completely removed has been, "this API has been deprecated since x.x, so we have to remove it".

As a long time contributor to and user of Spark this interpretation of the policy is concerning to me. This reasoning misses the intention of the original policy, and I am worried that it will hurt the long-term success of the project.

I definitely understand that these are hard decisions, and I'm not proposing that we never remove anything from Spark. However, I would like to give some additional context and also propose a different rubric for thinking about API breakage moving forward.

Spark adopted semantic versioning back in 2014 during the preparations for the 1.0 release. As this was the first major release -- and as, up until fairly recently, Spark had only been an academic project -- no real promises had been made about API stability ever.

During the discussion, some committers suggested that this was an opportunity to clean up cruft and give the Spark APIs a once-over, making cosmetic changes to improve consistency. However, in the end, it was decided that in many cases it was not in the best interests of the Spark community to break things just because we could. Matei actually said it pretty forcefully:

I know that some names are suboptimal, but I absolutely detest breaking APIs, config names, etc. I’ve seen it happen way too often in other projects (even things we depend on that are officially post-1.0, like Akka or Protobuf or Hadoop), and it’s very painful. I think that we as fairly cutting-edge users are okay with libraries occasionally changing, but many others will consider it a show-stopper. Given this, I think that any cosmetic change now, even though it might improve clarity slightly, is not worth the tradeoff in terms of creating an update barrier for existing users.

In the end, while some changes were made, most APIs remained the same and users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served the project very well, as compatibility means users are able to upgrade and we keep as many people on the latest versions of Spark (though maybe not the latest APIs of Spark) as possible.

As Spark grows, I think compatibility actually becomes more important and we should be more conservative rather than less. Today, there are very likely more Spark programs running than there were at any other time in the past. Spark is no longer a tool only used by advanced hackers, it is now also running "traditional enterprise workloads.'' In many cases these jobs are powering important processes long after the original author leaves.

Broken APIs can also affect libraries that extend Spark. This dependency can be even harder for users, as if the library has not been upgraded to use new APIs and they need that library, they are stuck.

Given all of this, I'd like to propose the following rubric as an addition to our semantic versioning policy. After discussion and if people agree this is a good idea, I'll call a vote of the PMC to ratify its inclusion in the official policy.

Considerations When Breaking APIs
The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.

Cost of Breaking an API
Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:
  • Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate: 
    • How long has the API been in Spark?
    • Is the API common even for basic programs?
    • How often do we see recent questions in JIRA or mailing lists?
    • How often does it appear in StackOverflow or blogs?
  • Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:
    • Will there be a compiler or linker error?
    • Will there be a runtime exception?
    • Will that exception happen after significant processing has been done?
    • Will we silently return different answers? (very hard to debug, might not even notice!)

Cost of Maintaining an API
Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.
  • Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.
  • User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.

Alternatives to Breaking an API
In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.

  • Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.
  • Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.
  • Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.
  • Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.

Examples

Here are some examples of how I think the policy above could be applied to different issues that have been discussed recently. These are only to illustrate how to apply the above rubric, but are not intended to be part of the official policy.

[SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow multiple creation of SparkContexts #23311

  • Cost to Break - Multiple Contexts in a single JVM never worked properly. When users tried it they would nearly always report that Spark was broken (SPARK-2243), due to the confusing set of logs messages. Given this, I think it is very unlikely that there are many real world use cases active today. Even those cases likely suffer from undiagnosed issues as there are many areas of Spark that assume a single context per JVM.
  • Cost to Maintain - We have recently had users ask on the mailing list if this was supported, as the conf led them to believe it was, and the existence of this configuration as "supported" makes it harder to reason about certain global state in SparkContext.

Decision: Remove this configuration and related code.

[SPARK-25908] Remove registerTempTable #22921 (only looking at one API of this PR)

  • Cost to Break - This is a wildly popular API of Spark SQL that has been there since the first release. There are tons of blog posts and examples that use this syntax if you google "dataframe registerTempTable" (even more than the "correct" API "dataframe createOrReplaceView"). All of these will be invalid for users of Spark 3.0
  • Cost to Maintain - This is just an alias, so there is not a lot of extra machinery required to keep the API. Users have two ways to do the same thing, but we can note that this is just an alias in the docs.

Decision: Do not remove this API, I would even consider un-deprecating it. I anecdotally asked several users and this is the API they prefer over the "correct" one.

[SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195

  • Cost to Break - I think that this case actually exemplifies several anti-patterns in breaking APIs. In some languages, the deprecation warning gives you no help, other than what version the function was removed in. In R, it points users to a really deep conversation on the semantics of time in Spark SQL. None of the messages tell you how you should correctly be parsing a timestamp that is given to you in a format other than UTC. My guess is all users will blindly flip the flag to true (to keep using this function), so you've only succeeded in annoying them.
  • Cost to Maintain - These are two relatively isolated expressions, there should be little cost to keeping them. Users can be confused by their semantics, so we probably should update the docs to point them to a best practice (I learned only by complaining on the PR, that a good practice is to parse timestamps including the timezone in the format expression, which naturally shifts them to UTC).

Decision: Do not deprecate these two functions. We should update the docs to talk about best practices for parsing timestamps, including how to correctly shift them to UTC for storage.

[SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902

  • Cost to Break - The TRIM function takes two string parameters. If we switch the parameter order, queries that use the TRIM function would silently get different results on different versions of Spark. Users may not notice it for a long time and wrong query results may cause serious problems to users.
  • Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM function in Scala API and in SQL have different parameter order.

Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a warning, not by removing) the SQL TRIM function and move users to the SQL standard TRIM syntax.

Thanks for taking the time to read this! Happy to discuss the specifics and amend this policy as the community sees fit.

Michael


Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Holden Karau


On Fri, Feb 28, 2020 at 9:48 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, Matei and Michael.

I'm also a big supporter for policy-based project management.

Before going further,

    1. Could you estimate how many revert commits are required in `branch-3.0` for new rubric?
    2. Are you going to revert all removed test cases for the deprecated ones?
This is a good point, making sure we keep the tests as well is important (worse than removing a deprecated API is shipping it broken),. 
    3. Does it make any delay for Apache Spark 3.0.0 release?
        (I believe it was previously scheduled on June before Spark Summit 2020)
I think if we need to delay to make a better release this is ok, especially given our current preview releases being available to gather community feedback. 

Although there was a discussion already, I want to make the following tough parts sure.

    4. We are not going to add Scala 2.11 API, right?
I hope not. 
    5. We are not going to support Python 2.x in Apache Spark 3.1+, right?
I think doing that would be bad, it's already end of lifed elsewhere. 
    6. Do we have enough resource for testing the deprecated ones?
        (Currently, we have 8 heavy Jenkins jobs for `branch-3.0` already.)

Especially, for (2) and (6), we know that keeping deprecated ones without testings doesn't give us any support for the new rubric.

Bests,
Dongjoon.

On Thu, Feb 27, 2020 at 5:31 PM Matei Zaharia <[hidden email]> wrote:
+1 on this new rubric. It definitely captures the issues I’ve seen in Spark and in other projects. If we write down this rubric (or something like it), it will also be easier to refer to it during code reviews or in proposals of new APIs (we could ask “do you expect to have to change this API in the future, and if so, how”).

Matei

On Feb 24, 2020, at 3:02 PM, Michael Armbrust <[hidden email]> wrote:

Hello Everyone,

As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic versioning, so this major release is our chance to get it right [by breaking APIs]". Similarly, in many cases the response to questions about why an API was completely removed has been, "this API has been deprecated since x.x, so we have to remove it".

As a long time contributor to and user of Spark this interpretation of the policy is concerning to me. This reasoning misses the intention of the original policy, and I am worried that it will hurt the long-term success of the project.

I definitely understand that these are hard decisions, and I'm not proposing that we never remove anything from Spark. However, I would like to give some additional context and also propose a different rubric for thinking about API breakage moving forward.

Spark adopted semantic versioning back in 2014 during the preparations for the 1.0 release. As this was the first major release -- and as, up until fairly recently, Spark had only been an academic project -- no real promises had been made about API stability ever.

During the discussion, some committers suggested that this was an opportunity to clean up cruft and give the Spark APIs a once-over, making cosmetic changes to improve consistency. However, in the end, it was decided that in many cases it was not in the best interests of the Spark community to break things just because we could. Matei actually said it pretty forcefully:

I know that some names are suboptimal, but I absolutely detest breaking APIs, config names, etc. I’ve seen it happen way too often in other projects (even things we depend on that are officially post-1.0, like Akka or Protobuf or Hadoop), and it’s very painful. I think that we as fairly cutting-edge users are okay with libraries occasionally changing, but many others will consider it a show-stopper. Given this, I think that any cosmetic change now, even though it might improve clarity slightly, is not worth the tradeoff in terms of creating an update barrier for existing users.

In the end, while some changes were made, most APIs remained the same and users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served the project very well, as compatibility means users are able to upgrade and we keep as many people on the latest versions of Spark (though maybe not the latest APIs of Spark) as possible.

As Spark grows, I think compatibility actually becomes more important and we should be more conservative rather than less. Today, there are very likely more Spark programs running than there were at any other time in the past. Spark is no longer a tool only used by advanced hackers, it is now also running "traditional enterprise workloads.'' In many cases these jobs are powering important processes long after the original author leaves.

Broken APIs can also affect libraries that extend Spark. This dependency can be even harder for users, as if the library has not been upgraded to use new APIs and they need that library, they are stuck.

Given all of this, I'd like to propose the following rubric as an addition to our semantic versioning policy. After discussion and if people agree this is a good idea, I'll call a vote of the PMC to ratify its inclusion in the official policy.

Considerations When Breaking APIs
The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.

Cost of Breaking an API
Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:
  • Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate: 
    • How long has the API been in Spark?
    • Is the API common even for basic programs?
    • How often do we see recent questions in JIRA or mailing lists?
    • How often does it appear in StackOverflow or blogs?
  • Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:
    • Will there be a compiler or linker error?
    • Will there be a runtime exception?
    • Will that exception happen after significant processing has been done?
    • Will we silently return different answers? (very hard to debug, might not even notice!)

Cost of Maintaining an API
Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.
  • Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.
  • User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.

Alternatives to Breaking an API
In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.

  • Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.
  • Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.
  • Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.
  • Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.

Examples

Here are some examples of how I think the policy above could be applied to different issues that have been discussed recently. These are only to illustrate how to apply the above rubric, but are not intended to be part of the official policy.

[SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow multiple creation of SparkContexts #23311

  • Cost to Break - Multiple Contexts in a single JVM never worked properly. When users tried it they would nearly always report that Spark was broken (SPARK-2243), due to the confusing set of logs messages. Given this, I think it is very unlikely that there are many real world use cases active today. Even those cases likely suffer from undiagnosed issues as there are many areas of Spark that assume a single context per JVM.
  • Cost to Maintain - We have recently had users ask on the mailing list if this was supported, as the conf led them to believe it was, and the existence of this configuration as "supported" makes it harder to reason about certain global state in SparkContext.

Decision: Remove this configuration and related code.

[SPARK-25908] Remove registerTempTable #22921 (only looking at one API of this PR)

  • Cost to Break - This is a wildly popular API of Spark SQL that has been there since the first release. There are tons of blog posts and examples that use this syntax if you google "dataframe registerTempTable" (even more than the "correct" API "dataframe createOrReplaceView"). All of these will be invalid for users of Spark 3.0
  • Cost to Maintain - This is just an alias, so there is not a lot of extra machinery required to keep the API. Users have two ways to do the same thing, but we can note that this is just an alias in the docs.

Decision: Do not remove this API, I would even consider un-deprecating it. I anecdotally asked several users and this is the API they prefer over the "correct" one.

[SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195

  • Cost to Break - I think that this case actually exemplifies several anti-patterns in breaking APIs. In some languages, the deprecation warning gives you no help, other than what version the function was removed in. In R, it points users to a really deep conversation on the semantics of time in Spark SQL. None of the messages tell you how you should correctly be parsing a timestamp that is given to you in a format other than UTC. My guess is all users will blindly flip the flag to true (to keep using this function), so you've only succeeded in annoying them.
  • Cost to Maintain - These are two relatively isolated expressions, there should be little cost to keeping them. Users can be confused by their semantics, so we probably should update the docs to point them to a best practice (I learned only by complaining on the PR, that a good practice is to parse timestamps including the timezone in the format expression, which naturally shifts them to UTC).

Decision: Do not deprecate these two functions. We should update the docs to talk about best practices for parsing timestamps, including how to correctly shift them to UTC for storage.

[SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902

  • Cost to Break - The TRIM function takes two string parameters. If we switch the parameter order, queries that use the TRIM function would silently get different results on different versions of Spark. Users may not notice it for a long time and wrong query results may cause serious problems to users.
  • Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM function in Scala API and in SQL have different parameter order.

Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a warning, not by removing) the SQL TRIM function and move users to the SQL standard TRIM syntax.

Thanks for taking the time to read this! Happy to discuss the specifics and amend this policy as the community sees fit.

Michael




--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Sean Owen-2
On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <[hidden email]> wrote:
>>     1. Could you estimate how many revert commits are required in `branch-3.0` for new rubric?

Fair question about what actual change this implies for 3.0? so far it
seems like some targeted, quite reasonable reverts. I don't think
anyone's suggesting reverting loads of changes.


>>     2. Are you going to revert all removed test cases for the deprecated ones?
> This is a good point, making sure we keep the tests as well is important (worse than removing a deprecated API is shipping it broken),.

(I'd say, yes of course! which seems consistent with what is happening now)


>>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>         (I believe it was previously scheduled on June before Spark Summit 2020)
>
> I think if we need to delay to make a better release this is ok, especially given our current preview releases being available to gather community feedback.

Of course these things block 3.0 -- all the more reason to keep it
specific and targeted -- but nothing so far seems inconsistent with
finishing in a month or two.


>> Although there was a discussion already, I want to make the following tough parts sure.
>>     4. We are not going to add Scala 2.11 API, right?
> I hope not.
>>
>>     5. We are not going to support Python 2.x in Apache Spark 3.1+, right?
> I think doing that would be bad, it's already end of lifed elsewhere.

Yeah this is an important subtext -- the valuable principles here
could be interpreted in many different ways depending on how much you
weight value of keeping APIs for compatibility vs value in simplifying
Spark and pushing users to newer APIs more forcibly. They're all
judgment calls, based on necessarily limited data about the universe
of users. We can only go on rare direct user feedback, on feedback
perhaps from vendors as proxies for a subset of users, and the general
good faith judgment of committers who have lived Spark for years.

My specific interpretation is that the standard is (correctly)
tightening going forward, and retroactively a bit for 3.0. But, I do
not think anyone is advocating for the logical extreme of, for
example, maintaining Scala 2.11 compatibility indefinitely. I think
that falls out readily from the rubric here: maintaining 2.11
compatibility is really quite painful if you ever support 2.13 too,
for example.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Dongjoon Hyun-2
Hi, All.

There is a on-going Xiao's PR referencing this email.


Bests,
Dongjoon.

On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <[hidden email]> wrote:
On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <[hidden email]> wrote:
>>     1. Could you estimate how many revert commits are required in `branch-3.0` for new rubric?

Fair question about what actual change this implies for 3.0? so far it
seems like some targeted, quite reasonable reverts. I don't think
anyone's suggesting reverting loads of changes.


>>     2. Are you going to revert all removed test cases for the deprecated ones?
> This is a good point, making sure we keep the tests as well is important (worse than removing a deprecated API is shipping it broken),.

(I'd say, yes of course! which seems consistent with what is happening now)


>>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>         (I believe it was previously scheduled on June before Spark Summit 2020)
>
> I think if we need to delay to make a better release this is ok, especially given our current preview releases being available to gather community feedback.

Of course these things block 3.0 -- all the more reason to keep it
specific and targeted -- but nothing so far seems inconsistent with
finishing in a month or two.


>> Although there was a discussion already, I want to make the following tough parts sure.
>>     4. We are not going to add Scala 2.11 API, right?
> I hope not.
>>
>>     5. We are not going to support Python 2.x in Apache Spark 3.1+, right?
> I think doing that would be bad, it's already end of lifed elsewhere.

Yeah this is an important subtext -- the valuable principles here
could be interpreted in many different ways depending on how much you
weight value of keeping APIs for compatibility vs value in simplifying
Spark and pushing users to newer APIs more forcibly. They're all
judgment calls, based on necessarily limited data about the universe
of users. We can only go on rare direct user feedback, on feedback
perhaps from vendors as proxies for a subset of users, and the general
good faith judgment of committers who have lived Spark for years.

My specific interpretation is that the standard is (correctly)
tightening going forward, and retroactively a bit for 3.0. But, I do
not think anyone is advocating for the logical extreme of, for
example, maintaining Scala 2.11 compatibility indefinitely. I think
that falls out readily from the rubric here: maintaining 2.11
compatibility is really quite painful if you ever support 2.13 too,
for example.
Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Dongjoon Hyun-2
Hi, All.

Recently, reverting PRs seems to start to spread like the *well-known* virus.
Can we finalize this first before doing unofficial personal decisions?
Technically, this thread was not a vote and our website doesn't have a clear policy yet.

https://github.com/apache/spark/pull/27821
[SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
    ==> This technically revert most of the SPARK-25908.

https://github.com/apache/spark/pull/27835
Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the operands"

https://github.com/apache/spark/pull/27834
Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Bests,
Dongjoon.

On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

There is a on-going Xiao's PR referencing this email.


Bests,
Dongjoon.

On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <[hidden email]> wrote:
On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <[hidden email]> wrote:
>>     1. Could you estimate how many revert commits are required in `branch-3.0` for new rubric?

Fair question about what actual change this implies for 3.0? so far it
seems like some targeted, quite reasonable reverts. I don't think
anyone's suggesting reverting loads of changes.


>>     2. Are you going to revert all removed test cases for the deprecated ones?
> This is a good point, making sure we keep the tests as well is important (worse than removing a deprecated API is shipping it broken),.

(I'd say, yes of course! which seems consistent with what is happening now)


>>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>         (I believe it was previously scheduled on June before Spark Summit 2020)
>
> I think if we need to delay to make a better release this is ok, especially given our current preview releases being available to gather community feedback.

Of course these things block 3.0 -- all the more reason to keep it
specific and targeted -- but nothing so far seems inconsistent with
finishing in a month or two.


>> Although there was a discussion already, I want to make the following tough parts sure.
>>     4. We are not going to add Scala 2.11 API, right?
> I hope not.
>>
>>     5. We are not going to support Python 2.x in Apache Spark 3.1+, right?
> I think doing that would be bad, it's already end of lifed elsewhere.

Yeah this is an important subtext -- the valuable principles here
could be interpreted in many different ways depending on how much you
weight value of keeping APIs for compatibility vs value in simplifying
Spark and pushing users to newer APIs more forcibly. They're all
judgment calls, based on necessarily limited data about the universe
of users. We can only go on rare direct user feedback, on feedback
perhaps from vendors as proxies for a subset of users, and the general
good faith judgment of committers who have lived Spark for years.

My specific interpretation is that the standard is (correctly)
tightening going forward, and retroactively a bit for 3.0. But, I do
not think anyone is advocating for the logical extreme of, for
example, maintaining Scala 2.11 compatibility indefinitely. I think
that falls out readily from the rubric here: maintaining 2.11
compatibility is really quite painful if you ever support 2.13 too,
for example.
Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Sean Owen-2
This thread established some good general principles, illustrated by a few good examples. It didn't draw specific conclusions about what to add back, which is why it wasn't at all controversial. What it means in specific cases is where there may be disagreement, and that harder question hasn't been addressed.

The reverts I have seen so far seemed like the obvious one, but yes, there are several more going on now, some pretty broad. I am not even sure what all of them are. In addition to below, https://github.com/apache/spark/pull/27839. Would it be too much overhead to post to this thread any changes that one believes are endorsed by these principles and perhaps a more strict interpretation of them now? It's important enough we should get any data points or input, and now. (We're obviously not going to debate each one.) A draft PR, or several, actually sounds like a good vehicle for that -- as long as people know about them!

Also, is there any usage data available to share? many arguments turn around 'commonly used' but can we know that more concretely?

Otherwise I think we'll back into implementing personal interpretations of general principles, which is arguably the issue in the first place, even when everyone believes in good faith in the same principles.



On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Recently, reverting PRs seems to start to spread like the *well-known* virus.
Can we finalize this first before doing unofficial personal decisions?
Technically, this thread was not a vote and our website doesn't have a clear policy yet.

https://github.com/apache/spark/pull/27821
[SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
    ==> This technically revert most of the SPARK-25908.

https://github.com/apache/spark/pull/27835
Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the operands"

https://github.com/apache/spark/pull/27834
Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Bests,
Dongjoon.

On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

There is a on-going Xiao's PR referencing this email.


Bests,
Dongjoon.

On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <[hidden email]> wrote:
On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <[hidden email]> wrote:
>>     1. Could you estimate how many revert commits are required in `branch-3.0` for new rubric?

Fair question about what actual change this implies for 3.0? so far it
seems like some targeted, quite reasonable reverts. I don't think
anyone's suggesting reverting loads of changes.


>>     2. Are you going to revert all removed test cases for the deprecated ones?
> This is a good point, making sure we keep the tests as well is important (worse than removing a deprecated API is shipping it broken),.

(I'd say, yes of course! which seems consistent with what is happening now)


>>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>         (I believe it was previously scheduled on June before Spark Summit 2020)
>
> I think if we need to delay to make a better release this is ok, especially given our current preview releases being available to gather community feedback.

Of course these things block 3.0 -- all the more reason to keep it
specific and targeted -- but nothing so far seems inconsistent with
finishing in a month or two.


>> Although there was a discussion already, I want to make the following tough parts sure.
>>     4. We are not going to add Scala 2.11 API, right?
> I hope not.
>>
>>     5. We are not going to support Python 2.x in Apache Spark 3.1+, right?
> I think doing that would be bad, it's already end of lifed elsewhere.

Yeah this is an important subtext -- the valuable principles here
could be interpreted in many different ways depending on how much you
weight value of keeping APIs for compatibility vs value in simplifying
Spark and pushing users to newer APIs more forcibly. They're all
judgment calls, based on necessarily limited data about the universe
of users. We can only go on rare direct user feedback, on feedback
perhaps from vendors as proxies for a subset of users, and the general
good faith judgment of committers who have lived Spark for years.

My specific interpretation is that the standard is (correctly)
tightening going forward, and retroactively a bit for 3.0. But, I do
not think anyone is advocating for the logical extreme of, for
example, maintaining Scala 2.11 compatibility indefinitely. I think
that falls out readily from the rubric here: maintaining 2.11
compatibility is really quite painful if you ever support 2.13 too,
for example.
Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Dongjoon Hyun-2
+1 for Sean's concerns and questions.

Bests,
Dongjoon.

On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <[hidden email]> wrote:
This thread established some good general principles, illustrated by a few good examples. It didn't draw specific conclusions about what to add back, which is why it wasn't at all controversial. What it means in specific cases is where there may be disagreement, and that harder question hasn't been addressed.

The reverts I have seen so far seemed like the obvious one, but yes, there are several more going on now, some pretty broad. I am not even sure what all of them are. In addition to below, https://github.com/apache/spark/pull/27839. Would it be too much overhead to post to this thread any changes that one believes are endorsed by these principles and perhaps a more strict interpretation of them now? It's important enough we should get any data points or input, and now. (We're obviously not going to debate each one.) A draft PR, or several, actually sounds like a good vehicle for that -- as long as people know about them!

Also, is there any usage data available to share? many arguments turn around 'commonly used' but can we know that more concretely?

Otherwise I think we'll back into implementing personal interpretations of general principles, which is arguably the issue in the first place, even when everyone believes in good faith in the same principles.



On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Recently, reverting PRs seems to start to spread like the *well-known* virus.
Can we finalize this first before doing unofficial personal decisions?
Technically, this thread was not a vote and our website doesn't have a clear policy yet.

https://github.com/apache/spark/pull/27821
[SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
    ==> This technically revert most of the SPARK-25908.

https://github.com/apache/spark/pull/27835
Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the operands"

https://github.com/apache/spark/pull/27834
Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Bests,
Dongjoon.

On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

There is a on-going Xiao's PR referencing this email.


Bests,
Dongjoon.

On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <[hidden email]> wrote:
On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <[hidden email]> wrote:
>>     1. Could you estimate how many revert commits are required in `branch-3.0` for new rubric?

Fair question about what actual change this implies for 3.0? so far it
seems like some targeted, quite reasonable reverts. I don't think
anyone's suggesting reverting loads of changes.


>>     2. Are you going to revert all removed test cases for the deprecated ones?
> This is a good point, making sure we keep the tests as well is important (worse than removing a deprecated API is shipping it broken),.

(I'd say, yes of course! which seems consistent with what is happening now)


>>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>         (I believe it was previously scheduled on June before Spark Summit 2020)
>
> I think if we need to delay to make a better release this is ok, especially given our current preview releases being available to gather community feedback.

Of course these things block 3.0 -- all the more reason to keep it
specific and targeted -- but nothing so far seems inconsistent with
finishing in a month or two.


>> Although there was a discussion already, I want to make the following tough parts sure.
>>     4. We are not going to add Scala 2.11 API, right?
> I hope not.
>>
>>     5. We are not going to support Python 2.x in Apache Spark 3.1+, right?
> I think doing that would be bad, it's already end of lifed elsewhere.

Yeah this is an important subtext -- the valuable principles here
could be interpreted in many different ways depending on how much you
weight value of keeping APIs for compatibility vs value in simplifying
Spark and pushing users to newer APIs more forcibly. They're all
judgment calls, based on necessarily limited data about the universe
of users. We can only go on rare direct user feedback, on feedback
perhaps from vendors as proxies for a subset of users, and the general
good faith judgment of committers who have lived Spark for years.

My specific interpretation is that the standard is (correctly)
tightening going forward, and retroactively a bit for 3.0. But, I do
not think anyone is advocating for the logical extreme of, for
example, maintaining Scala 2.11 compatibility indefinitely. I think
that falls out readily from the rubric here: maintaining 2.11
compatibility is really quite painful if you ever support 2.13 too,
for example.
Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Jungtaek Lim-2
+1 for Sean as well.

Moreover, as I added a voice on previous thread, if we want to be strict with retaining public API, what we really need to do along with this is having similar level or stricter of policy for adding public API. If we don't apply the policy symmetrically, problems would go worse as it's still not that hard to add public API (only require normal review) but once the API is added and released it's going to be really hard to remove it.

If we consider adding public API and deprecating/removing public API as "critical" one for the project, IMHO, it would give better visibility and open discussion if we make it going through dev@ mailing list instead of directly filing a PR. As there're so many PRs being submitted it's nearly impossible to look into all of PRs - it may require us to "watch" the repo and have tons of mails. Compared to the popularity on Github PRs, dev@ mailing list is not that crowded so less chance of missing the critical changes, and not quickly decided by only a couple of committers.

These suggestions would slow down the developments - that would make us realize we may want to "classify/mark" user facing public APIs and others (just exposed as public) and only apply all the policies to former. For latter we don't need to guarantee anything.


On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <[hidden email]> wrote:
+1 for Sean's concerns and questions.

Bests,
Dongjoon.

On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <[hidden email]> wrote:
This thread established some good general principles, illustrated by a few good examples. It didn't draw specific conclusions about what to add back, which is why it wasn't at all controversial. What it means in specific cases is where there may be disagreement, and that harder question hasn't been addressed.

The reverts I have seen so far seemed like the obvious one, but yes, there are several more going on now, some pretty broad. I am not even sure what all of them are. In addition to below, https://github.com/apache/spark/pull/27839. Would it be too much overhead to post to this thread any changes that one believes are endorsed by these principles and perhaps a more strict interpretation of them now? It's important enough we should get any data points or input, and now. (We're obviously not going to debate each one.) A draft PR, or several, actually sounds like a good vehicle for that -- as long as people know about them!

Also, is there any usage data available to share? many arguments turn around 'commonly used' but can we know that more concretely?

Otherwise I think we'll back into implementing personal interpretations of general principles, which is arguably the issue in the first place, even when everyone believes in good faith in the same principles.



On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Recently, reverting PRs seems to start to spread like the *well-known* virus.
Can we finalize this first before doing unofficial personal decisions?
Technically, this thread was not a vote and our website doesn't have a clear policy yet.

https://github.com/apache/spark/pull/27821
[SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
    ==> This technically revert most of the SPARK-25908.

https://github.com/apache/spark/pull/27835
Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the operands"

https://github.com/apache/spark/pull/27834
Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Bests,
Dongjoon.

On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

There is a on-going Xiao's PR referencing this email.


Bests,
Dongjoon.

On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <[hidden email]> wrote:
On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <[hidden email]> wrote:
>>     1. Could you estimate how many revert commits are required in `branch-3.0` for new rubric?

Fair question about what actual change this implies for 3.0? so far it
seems like some targeted, quite reasonable reverts. I don't think
anyone's suggesting reverting loads of changes.


>>     2. Are you going to revert all removed test cases for the deprecated ones?
> This is a good point, making sure we keep the tests as well is important (worse than removing a deprecated API is shipping it broken),.

(I'd say, yes of course! which seems consistent with what is happening now)


>>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>         (I believe it was previously scheduled on June before Spark Summit 2020)
>
> I think if we need to delay to make a better release this is ok, especially given our current preview releases being available to gather community feedback.

Of course these things block 3.0 -- all the more reason to keep it
specific and targeted -- but nothing so far seems inconsistent with
finishing in a month or two.


>> Although there was a discussion already, I want to make the following tough parts sure.
>>     4. We are not going to add Scala 2.11 API, right?
> I hope not.
>>
>>     5. We are not going to support Python 2.x in Apache Spark 3.1+, right?
> I think doing that would be bad, it's already end of lifed elsewhere.

Yeah this is an important subtext -- the valuable principles here
could be interpreted in many different ways depending on how much you
weight value of keeping APIs for compatibility vs value in simplifying
Spark and pushing users to newer APIs more forcibly. They're all
judgment calls, based on necessarily limited data about the universe
of users. We can only go on rare direct user feedback, on feedback
perhaps from vendors as proxies for a subset of users, and the general
good faith judgment of committers who have lived Spark for years.

My specific interpretation is that the standard is (correctly)
tightening going forward, and retroactively a bit for 3.0. But, I do
not think anyone is advocating for the logical extreme of, for
example, maintaining Scala 2.11 compatibility indefinitely. I think
that falls out readily from the rubric here: maintaining 2.11
compatibility is really quite painful if you ever support 2.13 too,
for example.
Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Xiao Li
I want to thank you Ruifeng Zheng publicly for his work that lists all the signature differences of Core, SQL and Hive we made in this upcoming release. For details, please read the files attached in SPARK-30982. I went over these files and submitted the following PRs to add back the SparkSQL APIs whose maintenance costs are low based on my own experiences in SparkSQL development:
If you think these APIs should not be added back, let me know and we can discuss the items further. In general, I think we should provide more evidences and discuss them publicly when we dropping these APIs at the beginning. 

+1 on Jungtaek's comments. When we making the API changes (e.g., adding the new APIs or changing the existing APIs), we should regularly publish them in the dev list. I am willing to lead this effort, work with my colleagues to summarize all the merged commits [especially the API changes], and then send the bi-weekly digest to the dev list. If you are willing to join this working group and help build these digests, feel free to send me a note [[hidden email]]. 

Cheers,

Xiao

 


Jungtaek Lim <[hidden email]> 于2020年3月7日周六 下午4:50写道:
+1 for Sean as well.

Moreover, as I added a voice on previous thread, if we want to be strict with retaining public API, what we really need to do along with this is having similar level or stricter of policy for adding public API. If we don't apply the policy symmetrically, problems would go worse as it's still not that hard to add public API (only require normal review) but once the API is added and released it's going to be really hard to remove it.

If we consider adding public API and deprecating/removing public API as "critical" one for the project, IMHO, it would give better visibility and open discussion if we make it going through dev@ mailing list instead of directly filing a PR. As there're so many PRs being submitted it's nearly impossible to look into all of PRs - it may require us to "watch" the repo and have tons of mails. Compared to the popularity on Github PRs, dev@ mailing list is not that crowded so less chance of missing the critical changes, and not quickly decided by only a couple of committers.

These suggestions would slow down the developments - that would make us realize we may want to "classify/mark" user facing public APIs and others (just exposed as public) and only apply all the policies to former. For latter we don't need to guarantee anything.


On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <[hidden email]> wrote:
+1 for Sean's concerns and questions.

Bests,
Dongjoon.

On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <[hidden email]> wrote:
This thread established some good general principles, illustrated by a few good examples. It didn't draw specific conclusions about what to add back, which is why it wasn't at all controversial. What it means in specific cases is where there may be disagreement, and that harder question hasn't been addressed.

The reverts I have seen so far seemed like the obvious one, but yes, there are several more going on now, some pretty broad. I am not even sure what all of them are. In addition to below, https://github.com/apache/spark/pull/27839. Would it be too much overhead to post to this thread any changes that one believes are endorsed by these principles and perhaps a more strict interpretation of them now? It's important enough we should get any data points or input, and now. (We're obviously not going to debate each one.) A draft PR, or several, actually sounds like a good vehicle for that -- as long as people know about them!

Also, is there any usage data available to share? many arguments turn around 'commonly used' but can we know that more concretely?

Otherwise I think we'll back into implementing personal interpretations of general principles, which is arguably the issue in the first place, even when everyone believes in good faith in the same principles.



On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Recently, reverting PRs seems to start to spread like the *well-known* virus.
Can we finalize this first before doing unofficial personal decisions?
Technically, this thread was not a vote and our website doesn't have a clear policy yet.

https://github.com/apache/spark/pull/27821
[SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
    ==> This technically revert most of the SPARK-25908.

https://github.com/apache/spark/pull/27835
Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the operands"

https://github.com/apache/spark/pull/27834
Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Bests,
Dongjoon.

On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

There is a on-going Xiao's PR referencing this email.


Bests,
Dongjoon.

On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <[hidden email]> wrote:
On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <[hidden email]> wrote:
>>     1. Could you estimate how many revert commits are required in `branch-3.0` for new rubric?

Fair question about what actual change this implies for 3.0? so far it
seems like some targeted, quite reasonable reverts. I don't think
anyone's suggesting reverting loads of changes.


>>     2. Are you going to revert all removed test cases for the deprecated ones?
> This is a good point, making sure we keep the tests as well is important (worse than removing a deprecated API is shipping it broken),.

(I'd say, yes of course! which seems consistent with what is happening now)


>>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>         (I believe it was previously scheduled on June before Spark Summit 2020)
>
> I think if we need to delay to make a better release this is ok, especially given our current preview releases being available to gather community feedback.

Of course these things block 3.0 -- all the more reason to keep it
specific and targeted -- but nothing so far seems inconsistent with
finishing in a month or two.


>> Although there was a discussion already, I want to make the following tough parts sure.
>>     4. We are not going to add Scala 2.11 API, right?
> I hope not.
>>
>>     5. We are not going to support Python 2.x in Apache Spark 3.1+, right?
> I think doing that would be bad, it's already end of lifed elsewhere.

Yeah this is an important subtext -- the valuable principles here
could be interpreted in many different ways depending on how much you
weight value of keeping APIs for compatibility vs value in simplifying
Spark and pushing users to newer APIs more forcibly. They're all
judgment calls, based on necessarily limited data about the universe
of users. We can only go on rare direct user feedback, on feedback
perhaps from vendors as proxies for a subset of users, and the general
good faith judgment of committers who have lived Spark for years.

My specific interpretation is that the standard is (correctly)
tightening going forward, and retroactively a bit for 3.0. But, I do
not think anyone is advocating for the logical extreme of, for
example, maintaining Scala 2.11 compatibility indefinitely. I think
that falls out readily from the rubric here: maintaining 2.11
compatibility is really quite painful if you ever support 2.13 too,
for example.
Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Takeshi Yamamuro
Yea, +1 on Jungtaek's suggestion; having the same strict policy for adding new APIs looks nice.

> When we making the API changes (e.g., adding the new APIs or changing the existing APIs), we should regularly publish them in the dev list. I am willing to lead this effort, work with my colleagues to summarize all the merged commits [especially the API changes], and then send the bi-weekly digest to the dev list

This digest looks very helpful for the community, thanks, Xiao!

Bests,
Takeshi

On Sun, Mar 8, 2020 at 12:05 PM Xiao Li <[hidden email]> wrote:
I want to thank you Ruifeng Zheng publicly for his work that lists all the signature differences of Core, SQL and Hive we made in this upcoming release. For details, please read the files attached in SPARK-30982. I went over these files and submitted the following PRs to add back the SparkSQL APIs whose maintenance costs are low based on my own experiences in SparkSQL development:
If you think these APIs should not be added back, let me know and we can discuss the items further. In general, I think we should provide more evidences and discuss them publicly when we dropping these APIs at the beginning. 

+1 on Jungtaek's comments. When we making the API changes (e.g., adding the new APIs or changing the existing APIs), we should regularly publish them in the dev list. I am willing to lead this effort, work with my colleagues to summarize all the merged commits [especially the API changes], and then send the bi-weekly digest to the dev list. If you are willing to join this working group and help build these digests, feel free to send me a note [[hidden email]]. 

Cheers,

Xiao

 


Jungtaek Lim <[hidden email]> 于2020年3月7日周六 下午4:50写道:
+1 for Sean as well.

Moreover, as I added a voice on previous thread, if we want to be strict with retaining public API, what we really need to do along with this is having similar level or stricter of policy for adding public API. If we don't apply the policy symmetrically, problems would go worse as it's still not that hard to add public API (only require normal review) but once the API is added and released it's going to be really hard to remove it.

If we consider adding public API and deprecating/removing public API as "critical" one for the project, IMHO, it would give better visibility and open discussion if we make it going through dev@ mailing list instead of directly filing a PR. As there're so many PRs being submitted it's nearly impossible to look into all of PRs - it may require us to "watch" the repo and have tons of mails. Compared to the popularity on Github PRs, dev@ mailing list is not that crowded so less chance of missing the critical changes, and not quickly decided by only a couple of committers.

These suggestions would slow down the developments - that would make us realize we may want to "classify/mark" user facing public APIs and others (just exposed as public) and only apply all the policies to former. For latter we don't need to guarantee anything.


On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <[hidden email]> wrote:
+1 for Sean's concerns and questions.

Bests,
Dongjoon.

On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <[hidden email]> wrote:
This thread established some good general principles, illustrated by a few good examples. It didn't draw specific conclusions about what to add back, which is why it wasn't at all controversial. What it means in specific cases is where there may be disagreement, and that harder question hasn't been addressed.

The reverts I have seen so far seemed like the obvious one, but yes, there are several more going on now, some pretty broad. I am not even sure what all of them are. In addition to below, https://github.com/apache/spark/pull/27839. Would it be too much overhead to post to this thread any changes that one believes are endorsed by these principles and perhaps a more strict interpretation of them now? It's important enough we should get any data points or input, and now. (We're obviously not going to debate each one.) A draft PR, or several, actually sounds like a good vehicle for that -- as long as people know about them!

Also, is there any usage data available to share? many arguments turn around 'commonly used' but can we know that more concretely?

Otherwise I think we'll back into implementing personal interpretations of general principles, which is arguably the issue in the first place, even when everyone believes in good faith in the same principles.



On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Recently, reverting PRs seems to start to spread like the *well-known* virus.
Can we finalize this first before doing unofficial personal decisions?
Technically, this thread was not a vote and our website doesn't have a clear policy yet.

https://github.com/apache/spark/pull/27821
[SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
    ==> This technically revert most of the SPARK-25908.

https://github.com/apache/spark/pull/27835
Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the operands"

https://github.com/apache/spark/pull/27834
Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Bests,
Dongjoon.

On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

There is a on-going Xiao's PR referencing this email.


Bests,
Dongjoon.

On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <[hidden email]> wrote:
On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <[hidden email]> wrote:
>>     1. Could you estimate how many revert commits are required in `branch-3.0` for new rubric?

Fair question about what actual change this implies for 3.0? so far it
seems like some targeted, quite reasonable reverts. I don't think
anyone's suggesting reverting loads of changes.


>>     2. Are you going to revert all removed test cases for the deprecated ones?
> This is a good point, making sure we keep the tests as well is important (worse than removing a deprecated API is shipping it broken),.

(I'd say, yes of course! which seems consistent with what is happening now)


>>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>         (I believe it was previously scheduled on June before Spark Summit 2020)
>
> I think if we need to delay to make a better release this is ok, especially given our current preview releases being available to gather community feedback.

Of course these things block 3.0 -- all the more reason to keep it
specific and targeted -- but nothing so far seems inconsistent with
finishing in a month or two.


>> Although there was a discussion already, I want to make the following tough parts sure.
>>     4. We are not going to add Scala 2.11 API, right?
> I hope not.
>>
>>     5. We are not going to support Python 2.x in Apache Spark 3.1+, right?
> I think doing that would be bad, it's already end of lifed elsewhere.

Yeah this is an important subtext -- the valuable principles here
could be interpreted in many different ways depending on how much you
weight value of keeping APIs for compatibility vs value in simplifying
Spark and pushing users to newer APIs more forcibly. They're all
judgment calls, based on necessarily limited data about the universe
of users. We can only go on rare direct user feedback, on feedback
perhaps from vendors as proxies for a subset of users, and the general
good faith judgment of committers who have lived Spark for years.

My specific interpretation is that the standard is (correctly)
tightening going forward, and retroactively a bit for 3.0. But, I do
not think anyone is advocating for the logical extreme of, for
example, maintaining Scala 2.11 compatibility indefinitely. I think
that falls out readily from the rubric here: maintaining 2.11
compatibility is really quite painful if you ever support 2.13 too,
for example.


--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Dongjoon Hyun-2
Thank you all. Especially, the Audit efforts.

Until now, the whole community has been working together in the same direction with the existing policy. It is always good.

Since it seems that we are considering to have a new direction, I created an umbrella JIRA to track all activities.

      https://issues.apache.org/jira/browse/SPARK-31085
      Amend Spark's Semantic Versioning Policy

As we know, the community-wide directional change always has a huge impact on daily PR reviews and regular releases. So, we had better consider all the reverting PRs as a normal independent PR instead of the follow-ups. Specifically, I believe we need the following.

    1. Have new JIRA IDs instead of considering a simple revert or follow-up.
        It's because we are not adding everything back blindly. For example,
            https://issues.apache.org/jira/browse/SPARK-31089
            "Add back ImageSchema.readImages in Spark 3.0"
        is created and closed as 'Won't Do' with consideration between the trade-off.
        We need to have a JIRA-issue-level history for this kind of request and the decision.

    2. Sometime, as described by Michael, reverting is insufficient.
        We need to provide a more fine-grained deprecation for users' safety case by case.
 
    3. Given the timeline, newly added API should have a test coverage in the same PR from the beginning.
        This is required because the whole reverting efforts aim to give a working API back.

I believe that we have a good discussion in this thread.
We are making a big change in Apache Spark history.
Please be part of the history by taking actions like replying, voting, and reviewing.

Thanks,
Dongjoon.


On Sat, Mar 7, 2020 at 11:20 PM Takeshi Yamamuro <[hidden email]> wrote:
Yea, +1 on Jungtaek's suggestion; having the same strict policy for adding new APIs looks nice.

> When we making the API changes (e.g., adding the new APIs or changing the existing APIs), we should regularly publish them in the dev list. I am willing to lead this effort, work with my colleagues to summarize all the merged commits [especially the API changes], and then send the bi-weekly digest to the dev list

This digest looks very helpful for the community, thanks, Xiao!

Bests,
Takeshi

On Sun, Mar 8, 2020 at 12:05 PM Xiao Li <[hidden email]> wrote:
I want to thank you Ruifeng Zheng publicly for his work that lists all the signature differences of Core, SQL and Hive we made in this upcoming release. For details, please read the files attached in SPARK-30982. I went over these files and submitted the following PRs to add back the SparkSQL APIs whose maintenance costs are low based on my own experiences in SparkSQL development:
If you think these APIs should not be added back, let me know and we can discuss the items further. In general, I think we should provide more evidences and discuss them publicly when we dropping these APIs at the beginning. 

+1 on Jungtaek's comments. When we making the API changes (e.g., adding the new APIs or changing the existing APIs), we should regularly publish them in the dev list. I am willing to lead this effort, work with my colleagues to summarize all the merged commits [especially the API changes], and then send the bi-weekly digest to the dev list. If you are willing to join this working group and help build these digests, feel free to send me a note [[hidden email]]. 

Cheers,

Xiao

 


Jungtaek Lim <[hidden email]> 于2020年3月7日周六 下午4:50写道:
+1 for Sean as well.

Moreover, as I added a voice on previous thread, if we want to be strict with retaining public API, what we really need to do along with this is having similar level or stricter of policy for adding public API. If we don't apply the policy symmetrically, problems would go worse as it's still not that hard to add public API (only require normal review) but once the API is added and released it's going to be really hard to remove it.

If we consider adding public API and deprecating/removing public API as "critical" one for the project, IMHO, it would give better visibility and open discussion if we make it going through dev@ mailing list instead of directly filing a PR. As there're so many PRs being submitted it's nearly impossible to look into all of PRs - it may require us to "watch" the repo and have tons of mails. Compared to the popularity on Github PRs, dev@ mailing list is not that crowded so less chance of missing the critical changes, and not quickly decided by only a couple of committers.

These suggestions would slow down the developments - that would make us realize we may want to "classify/mark" user facing public APIs and others (just exposed as public) and only apply all the policies to former. For latter we don't need to guarantee anything.


On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <[hidden email]> wrote:
+1 for Sean's concerns and questions.

Bests,
Dongjoon.

On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <[hidden email]> wrote:
This thread established some good general principles, illustrated by a few good examples. It didn't draw specific conclusions about what to add back, which is why it wasn't at all controversial. What it means in specific cases is where there may be disagreement, and that harder question hasn't been addressed.

The reverts I have seen so far seemed like the obvious one, but yes, there are several more going on now, some pretty broad. I am not even sure what all of them are. In addition to below, https://github.com/apache/spark/pull/27839. Would it be too much overhead to post to this thread any changes that one believes are endorsed by these principles and perhaps a more strict interpretation of them now? It's important enough we should get any data points or input, and now. (We're obviously not going to debate each one.) A draft PR, or several, actually sounds like a good vehicle for that -- as long as people know about them!

Also, is there any usage data available to share? many arguments turn around 'commonly used' but can we know that more concretely?

Otherwise I think we'll back into implementing personal interpretations of general principles, which is arguably the issue in the first place, even when everyone believes in good faith in the same principles.



On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Recently, reverting PRs seems to start to spread like the *well-known* virus.
Can we finalize this first before doing unofficial personal decisions?
Technically, this thread was not a vote and our website doesn't have a clear policy yet.

https://github.com/apache/spark/pull/27821
[SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
    ==> This technically revert most of the SPARK-25908.

https://github.com/apache/spark/pull/27835
Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the operands"

https://github.com/apache/spark/pull/27834
Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Bests,
Dongjoon.

On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

There is a on-going Xiao's PR referencing this email.


Bests,
Dongjoon.

On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <[hidden email]> wrote:
On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <[hidden email]> wrote:
>>     1. Could you estimate how many revert commits are required in `branch-3.0` for new rubric?

Fair question about what actual change this implies for 3.0? so far it
seems like some targeted, quite reasonable reverts. I don't think
anyone's suggesting reverting loads of changes.


>>     2. Are you going to revert all removed test cases for the deprecated ones?
> This is a good point, making sure we keep the tests as well is important (worse than removing a deprecated API is shipping it broken),.

(I'd say, yes of course! which seems consistent with what is happening now)


>>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>         (I believe it was previously scheduled on June before Spark Summit 2020)
>
> I think if we need to delay to make a better release this is ok, especially given our current preview releases being available to gather community feedback.

Of course these things block 3.0 -- all the more reason to keep it
specific and targeted -- but nothing so far seems inconsistent with
finishing in a month or two.


>> Although there was a discussion already, I want to make the following tough parts sure.
>>     4. We are not going to add Scala 2.11 API, right?
> I hope not.
>>
>>     5. We are not going to support Python 2.x in Apache Spark 3.1+, right?
> I think doing that would be bad, it's already end of lifed elsewhere.

Yeah this is an important subtext -- the valuable principles here
could be interpreted in many different ways depending on how much you
weight value of keeping APIs for compatibility vs value in simplifying
Spark and pushing users to newer APIs more forcibly. They're all
judgment calls, based on necessarily limited data about the universe
of users. We can only go on rare direct user feedback, on feedback
perhaps from vendors as proxies for a subset of users, and the general
good faith judgment of committers who have lived Spark for years.

My specific interpretation is that the standard is (correctly)
tightening going forward, and retroactively a bit for 3.0. But, I do
not think anyone is advocating for the logical extreme of, for
example, maintaining Scala 2.11 compatibility indefinitely. I think
that falls out readily from the rubric here: maintaining 2.11
compatibility is really quite painful if you ever support 2.13 too,
for example.


--
---
Takeshi Yamamuro
12