[DISCUSS] Handling correctness/data loss jiras

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Handling correctness/data loss jiras

Tom Graves-2
Hello all,

I've noticed some inconsistencies in the way we are handling data loss/correctness issues.  I think we need to take these very seriously as they could be causing businesses real money and impacting real decisions and business logic.   I would like to discuss how we can make sure these are handled consistently and with urgency going forward.  

A few things I would like to propose are below.  Most of these are up to the developers and committers to ensure happen so want to know what everyone thinks and if people have other ideas?

- label any correctness/data loss jira with "correctness"
- jira marked as blocker by default if someone suspects a corruption/loss issue
- Make sure description is clear about when it occurs and impact to the user.   
- ensure its back ported to all active branches
- See if we can have a separate section in the release notes for these


Thanks,
Tom Graves
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Handling correctness/data loss jiras

Sean Owen-2
I doubt the question is whether people want to take such issues seriously -- all else equal, of course everyone does. 

A JIRA label plus place in the release notes sounds like a good concrete step that isn't happening consistently now. That's a clear flag that at least one person believes issue X is a blocker. 

Is this about specific JIRAs? I think it's more useful to illustrate in the context of specific issues. For example I haven't been following JIRAs well, and don't know what is being contested here.

I share frustration that Somebody should be working on Important Things, but don't think the difference between getting those done and not done is reminding people that Important Things need doing. What's the cause that leads to concrete corrective action?

Do we need more committers? Fewer new features? More conservative releases? Less work on X to work on this?

Lastly you raise an important question as an aside, one we haven't answered: when does a branch go inactive? I am sure 2.0.x is inactive, de facto, along with all 1.x. I think 2.1.x is inactive too. Should we put any rough guidance in place? a branch is maintained for 12-18 months?




On Mon, Aug 13, 2018 at 8:45 AM Tom Graves <[hidden email]> wrote:
Hello all,

I've noticed some inconsistencies in the way we are handling data loss/correctness issues.  I think we need to take these very seriously as they could be causing businesses real money and impacting real decisions and business logic.   I would like to discuss how we can make sure these are handled consistently and with urgency going forward.  

A few things I would like to propose are below.  Most of these are up to the developers and committers to ensure happen so want to know what everyone thinks and if people have other ideas?

- label any correctness/data loss jira with "correctness"
- jira marked as blocker by default if someone suspects a corruption/loss issue
- Make sure description is clear about when it occurs and impact to the user.   
- ensure its back ported to all active branches
- See if we can have a separate section in the release notes for these


Thanks,
Tom Graves
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Handling correctness/data loss jiras

Tom Graves-2

Not a specific jira but was looking at all the recent jiras with the "correctness" label and things are definitely being handled in consistently in my opinion (https://issues.apache.org/jira/issues/?jql=labels+%3D+correctness).    The inconsistencies are in the things I've mentioned above.  Priority is not set high enough, description is not clear,  some backported to 2.2, some not.  Obviously there could be ones without the "correctness" label as well since until recently I was also not aware that this label should be applied for this type of issues.

We have no real guidelines in this area for developers and committers to follow so I think defining some would help everyone. 

I realize everyone's time is important and everyone has different priorities but I think this sort of issue would be one we as a community should take care of above everything else.  If I'm a business using Apache Spark for business critical things and I find that there is data loss or corruption issues consistently in the releases and its not our highest priority to fix, I'm going to very hesitant to use and stay with Spark. 

One specific example of priority is in the 2.4 code freeze/release thread where it was brought up to release without SPARK-23243. And really we have done a bunch of releases without this, but until recently it wasn't marked as a blocker as well.  I'll admit that I missed this jira when it was filed and only recently became aware of it.  I changed the priority on it.   

|  I share frustration that Somebody should be working on Important Things, but don't think the difference between getting those done and not done is reminding people that Important Things need doing. What's the cause that leads to concrete corrective action?

I'm not really sure what you mean by this, this proposal is to introduce a process for this type of issue so its at least brought to peoples attention. We can't do anything to make people work on certain things.  If they aren't raised as important issues then its really easy to miss these things.  If its a blocker we should also not be doing any new releases without a fix for it which may motivate people to look at it.

I agree it would be good for us to make it more official about which branches are being maintained.  I think at this point its still 2.1.x, 2.2.x, and 2.3.x since we recently did releases of all of these.  Since 2.4 will be coming out we should definitely think about stop maintaining 2.1.x.  Perhaps we need a table on our release page about this.  But this should be a separate thread.


Tom

On Monday, August 13, 2018, 9:03:42 AM CDT, Sean Owen <[hidden email]> wrote:


I doubt the question is whether people want to take such issues seriously -- all else equal, of course everyone does. 

A JIRA label plus place in the release notes sounds like a good concrete step that isn't happening consistently now. That's a clear flag that at least one person believes issue X is a blocker. 

Is this about specific JIRAs? I think it's more useful to illustrate in the context of specific issues. For example I haven't been following JIRAs well, and don't know what is being contested here.

I share frustration that Somebody should be working on Important Things, but don't think the difference between getting those done and not done is reminding people that Important Things need doing. What's the cause that leads to concrete corrective action?

Do we need more committers? Fewer new features? More conservative releases? Less work on X to work on this?

Lastly you raise an important question as an aside, one we haven't answered: when does a branch go inactive? I am sure 2.0.x is inactive, de facto, along with all 1.x. I think 2.1.x is inactive too. Should we put any rough guidance in place? a branch is maintained for 12-18 months?




On Mon, Aug 13, 2018 at 8:45 AM Tom Graves <[hidden email]> wrote:
Hello all,

I've noticed some inconsistencies in the way we are handling data loss/correctness issues.  I think we need to take these very seriously as they could be causing businesses real money and impacting real decisions and business logic.   I would like to discuss how we can make sure these are handled consistently and with urgency going forward.  

A few things I would like to propose are below.  Most of these are up to the developers and committers to ensure happen so want to know what everyone thinks and if people have other ideas?

- label any correctness/data loss jira with "correctness"
- jira marked as blocker by default if someone suspects a corruption/loss issue
- Make sure description is clear about when it occurs and impact to the user.   
- ensure its back ported to all active branches
- See if we can have a separate section in the release notes for these


Thanks,
Tom Graves
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Handling correctness/data loss jiras

Imran Rashid-4
In reply to this post by Sean Owen-2
I don't think we've been great about backporting correctness issues.  This is one example which comes to mind (not to point fingers, just the one I know of immediately):


we also let another related issue slide for quite a while:


it turns out that issue was actually extremely tricky, so its not because it got totally ignored -- but on the other hand, maybe it didn't really get quite the attention it deserved.

Using the jira label is good -- should we also have a message to dev, with [CORRECTNESS-BUG] or something in the subject line when an issue like this is discovered, to raise the profile?  (I realize you could get the same info from jira, but with so much jira volume, maybe its worth an extra mode.)

Do we need more committers? Fewer new features? More conservative releases? Less work on X to work on this?

this is a good question.  More committers would be good to take on the volume of things to be done, but the flipside of this is that we need to make sure committers can do thorough reviews in the first place to prevent these types of bugs, and respond to them when they show up.  But its also worth noting this bug was *not* introduced by somebody inexperienced -- its been around for a long time, and lots of experienced devs have managed to overlook it:


Really, I'd like to see committers (and folks that want to be committers) realize they sometimes have to set aside their feature work to take on the these more important things, even if it means their features slip a release.  In practice, this means fewer new features.   So I do think we should be working to bring on more committers, but we have to make sure we're not losing sight of these things.

On Mon, Aug 13, 2018 at 9:03 AM, Sean Owen <[hidden email]> wrote:
I doubt the question is whether people want to take such issues seriously -- all else equal, of course everyone does. 

A JIRA label plus place in the release notes sounds like a good concrete step that isn't happening consistently now. That's a clear flag that at least one person believes issue X is a blocker. 

Is this about specific JIRAs? I think it's more useful to illustrate in the context of specific issues. For example I haven't been following JIRAs well, and don't know what is being contested here.

I share frustration that Somebody should be working on Important Things, but don't think the difference between getting those done and not done is reminding people that Important Things need doing. What's the cause that leads to concrete corrective action?

Do we need more committers? Fewer new features? More conservative releases? Less work on X to work on this?

Lastly you raise an important question as an aside, one we haven't answered: when does a branch go inactive? I am sure 2.0.x is inactive, de facto, along with all 1.x. I think 2.1.x is inactive too. Should we put any rough guidance in place? a branch is maintained for 12-18 months?




On Mon, Aug 13, 2018 at 8:45 AM Tom Graves <[hidden email]> wrote:
Hello all,

I've noticed some inconsistencies in the way we are handling data loss/correctness issues.  I think we need to take these very seriously as they could be causing businesses real money and impacting real decisions and business logic.   I would like to discuss how we can make sure these are handled consistently and with urgency going forward.  

A few things I would like to propose are below.  Most of these are up to the developers and committers to ensure happen so want to know what everyone thinks and if people have other ideas?

- label any correctness/data loss jira with "correctness"
- jira marked as blocker by default if someone suspects a corruption/loss issue
- Make sure description is clear about when it occurs and impact to the user.   
- ensure its back ported to all active branches
- See if we can have a separate section in the release notes for these


Thanks,
Tom Graves

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Handling correctness/data loss jiras

Sean Owen-3
In reply to this post by Tom Graves-2
Generally: if someone thinks correctness fix X should be backported further, I'd say just do it, if it's to an active release branch (see below). Anything that important has to outweigh most any other concern, like behavior changes.


On Mon, Aug 13, 2018 at 11:08 AM Tom Graves <[hidden email]> wrote:
I'm not really sure what you mean by this, this proposal is to introduce a process for this type of issue so its at least brought to peoples attention. We can't do anything to make people work on certain things.  If they aren't raised as important issues then its really easy to miss these things.  If its a blocker we should also not be doing any new releases without a fix for it which may motivate people to look at it.

I mean, what are concrete steps beyond saying this is a problem? That's the important thing to discuss.

There's a good one here: let's say anything that's likely to be a correctness or data loss issue should automatically be labeled 'correctness' as such and set to Blocker. 

That can go into the how-to-contribute manual in the docs and in a note to dev@.
 
 
I agree it would be good for us to make it more official about which branches are being maintained.  I think at this point its still 2.1.x, 2.2.x, and 2.3.x since we recently did releases of all of these.  Since 2.4 will be coming out we should definitely think about stop maintaining 2.1.x.  Perhaps we need a table on our release page about this.  But this should be a separate thread.


I propose writing something like this in the 'versioning' doc page, to at least establish a policy:

Minor release branches will, generally, be maintained with bug fixes releases for a period of 18 months. For example, branch 2.1.x is no longer considered maintained as of July 2018, 18 months after the release of 2.1.0 in December 2106.

This gives us -- and more importantly users -- some understanding of what to expect for backporting and fixes.


I am going to revive the thread about adding PMC / committers as it's overdue. That may not do much, but, more hands to do more work ought to possibly free up people to focus on deeper harder issues.
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Handling correctness/data loss jiras

Tom Graves-2

I mean, what are concrete steps beyond saying this is a problem? That's the important thing to discuss.

Sorry I'm a bit confused by your statement but also think I agree.  I started this thread for this reason. I pointed out that I thought it was a problem and also brought up things I thought we could do to help fix it.  

Maybe I wasn't clear in the first email, the list of things I had were proposals on what we do for a jira that is for a correctness/data loss issue. Its the committers and developers that are involved in this though so if people don't agree or aren't going to do them, then it doesn't work.

Just to restate what I think we should do:

- label any correctness/data loss jira with "correctness"
- jira should be marked as a blocker by default if someone suspects a corruption/loss issue
- Make sure the description is clear about when it occurs and impact to the user.   
- ensure its back ported to all active branches
- See if we can have a separate section in the release notes for these

The last one I guess is more a one time thing that i can file a jira for.  The first 4 would be done for each jira filed.

I'm proposing we do these things and as such if people agree we would also document those things in the committers or developers guide and send email to the list. 

 

Tom
On Monday, August 13, 2018, 11:17:22 AM CDT, Sean Owen <[hidden email]> wrote:


Generally: if someone thinks correctness fix X should be backported further, I'd say just do it, if it's to an active release branch (see below). Anything that important has to outweigh most any other concern, like behavior changes.


On Mon, Aug 13, 2018 at 11:08 AM Tom Graves <[hidden email]> wrote:
I'm not really sure what you mean by this, this proposal is to introduce a process for this type of issue so its at least brought to peoples attention. We can't do anything to make people work on certain things.  If they aren't raised as important issues then its really easy to miss these things.  If its a blocker we should also not be doing any new releases without a fix for it which may motivate people to look at it.

I mean, what are concrete steps beyond saying this is a problem? That's the important thing to discuss.

There's a good one here: let's say anything that's likely to be a correctness or data loss issue should automatically be labeled 'correctness' as such and set to Blocker. 

That can go into the how-to-contribute manual in the docs and in a note to dev@.
 
 
I agree it would be good for us to make it more official about which branches are being maintained.  I think at this point its still 2.1.x, 2.2.x, and 2.3.x since we recently did releases of all of these.  Since 2.4 will be coming out we should definitely think about stop maintaining 2.1.x.  Perhaps we need a table on our release page about this.  But this should be a separate thread.


I propose writing something like this in the 'versioning' doc page, to at least establish a policy:

Minor release branches will, generally, be maintained with bug fixes releases for a period of 18 months. For example, branch 2.1.x is no longer considered maintained as of July 2018, 18 months after the release of 2.1.0 in December 2106.

This gives us -- and more importantly users -- some understanding of what to expect for backporting and fixes.


I am going to revive the thread about adding PMC / committers as it's overdue. That may not do much, but, more hands to do more work ought to possibly free up people to focus on deeper harder issues.
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Handling correctness/data loss jiras

Imran Rashid-4
+1 on what we should do.

On Mon, Aug 13, 2018 at 3:06 PM, Tom Graves <[hidden email]> wrote:

I mean, what are concrete steps beyond saying this is a problem? That's the important thing to discuss.

Sorry I'm a bit confused by your statement but also think I agree.  I started this thread for this reason. I pointed out that I thought it was a problem and also brought up things I thought we could do to help fix it.  

Maybe I wasn't clear in the first email, the list of things I had were proposals on what we do for a jira that is for a correctness/data loss issue. Its the committers and developers that are involved in this though so if people don't agree or aren't going to do them, then it doesn't work.

Just to restate what I think we should do:

- label any correctness/data loss jira with "correctness"
- jira should be marked as a blocker by default if someone suspects a corruption/loss issue
- Make sure the description is clear about when it occurs and impact to the user.   
- ensure its back ported to all active branches
- See if we can have a separate section in the release notes for these

The last one I guess is more a one time thing that i can file a jira for.  The first 4 would be done for each jira filed.

I'm proposing we do these things and as such if people agree we would also document those things in the committers or developers guide and send email to the list. 

 

Tom
On Monday, August 13, 2018, 11:17:22 AM CDT, Sean Owen <[hidden email]> wrote:


Generally: if someone thinks correctness fix X should be backported further, I'd say just do it, if it's to an active release branch (see below). Anything that important has to outweigh most any other concern, like behavior changes.


On Mon, Aug 13, 2018 at 11:08 AM Tom Graves <[hidden email]> wrote:
I'm not really sure what you mean by this, this proposal is to introduce a process for this type of issue so its at least brought to peoples attention. We can't do anything to make people work on certain things.  If they aren't raised as important issues then its really easy to miss these things.  If its a blocker we should also not be doing any new releases without a fix for it which may motivate people to look at it.

I mean, what are concrete steps beyond saying this is a problem? That's the important thing to discuss.

There's a good one here: let's say anything that's likely to be a correctness or data loss issue should automatically be labeled 'correctness' as such and set to Blocker. 

That can go into the how-to-contribute manual in the docs and in a note to dev@.
 
 
I agree it would be good for us to make it more official about which branches are being maintained.  I think at this point its still 2.1.x, 2.2.x, and 2.3.x since we recently did releases of all of these.  Since 2.4 will be coming out we should definitely think about stop maintaining 2.1.x.  Perhaps we need a table on our release page about this.  But this should be a separate thread.


I propose writing something like this in the 'versioning' doc page, to at least establish a policy:

Minor release branches will, generally, be maintained with bug fixes releases for a period of 18 months. For example, branch 2.1.x is no longer considered maintained as of July 2018, 18 months after the release of 2.1.0 in December 2106.

This gives us -- and more importantly users -- some understanding of what to expect for backporting and fixes.


I am going to revive the thread about adding PMC / committers as it's overdue. That may not do much, but, more hands to do more work ought to possibly free up people to focus on deeper harder issues.

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Handling correctness/data loss jiras

Tom Graves-2
Since we haven't heard any objections to this, the documentation has been updated (Thanks to Sean).

All devs please make sure to re-read: http://spark.apache.org/contributing.html .

Note the set of labels used in Jira has been documented and correctness or data loss issues should be marked as blocker by default.  There is also a label to mark the jira as having something needing to go into the release-notes.


Tom

On Tuesday, August 14, 2018, 3:32:27 PM CDT, Imran Rashid <[hidden email]> wrote:


+1 on what we should do.

On Mon, Aug 13, 2018 at 3:06 PM, Tom Graves <[hidden email]> wrote:

I mean, what are concrete steps beyond saying this is a problem? That's the important thing to discuss.

Sorry I'm a bit confused by your statement but also think I agree.  I started this thread for this reason. I pointed out that I thought it was a problem and also brought up things I thought we could do to help fix it.  

Maybe I wasn't clear in the first email, the list of things I had were proposals on what we do for a jira that is for a correctness/data loss issue. Its the committers and developers that are involved in this though so if people don't agree or aren't going to do them, then it doesn't work.

Just to restate what I think we should do:

- label any correctness/data loss jira with "correctness"
- jira should be marked as a blocker by default if someone suspects a corruption/loss issue
- Make sure the description is clear about when it occurs and impact to the user.   
- ensure its back ported to all active branches
- See if we can have a separate section in the release notes for these

The last one I guess is more a one time thing that i can file a jira for.  The first 4 would be done for each jira filed.

I'm proposing we do these things and as such if people agree we would also document those things in the committers or developers guide and send email to the list. 

 

Tom
On Monday, August 13, 2018, 11:17:22 AM CDT, Sean Owen <[hidden email]> wrote:


Generally: if someone thinks correctness fix X should be backported further, I'd say just do it, if it's to an active release branch (see below). Anything that important has to outweigh most any other concern, like behavior changes.


On Mon, Aug 13, 2018 at 11:08 AM Tom Graves <[hidden email]> wrote:
I'm not really sure what you mean by this, this proposal is to introduce a process for this type of issue so its at least brought to peoples attention. We can't do anything to make people work on certain things.  If they aren't raised as important issues then its really easy to miss these things.  If its a blocker we should also not be doing any new releases without a fix for it which may motivate people to look at it.

I mean, what are concrete steps beyond saying this is a problem? That's the important thing to discuss.

There's a good one here: let's say anything that's likely to be a correctness or data loss issue should automatically be labeled 'correctness' as such and set to Blocker. 

That can go into the how-to-contribute manual in the docs and in a note to dev@.
 
 
I agree it would be good for us to make it more official about which branches are being maintained.  I think at this point its still 2.1.x, 2.2.x, and 2.3.x since we recently did releases of all of these.  Since 2.4 will be coming out we should definitely think about stop maintaining 2.1.x.  Perhaps we need a table on our release page about this.  But this should be a separate thread.


I propose writing something like this in the 'versioning' doc page, to at least establish a policy:

Minor release branches will, generally, be maintained with bug fixes releases for a period of 18 months. For example, branch 2.1.x is no longer considered maintained as of July 2018, 18 months after the release of 2.1.0 in December 2106.

This gives us -- and more importantly users -- some understanding of what to expect for backporting and fixes.


I am going to revive the thread about adding PMC / committers as it's overdue. That may not do much, but, more hands to do more work ought to possibly free up people to focus on deeper harder issues.