planning & discussion for larger scheduler changes

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

planning & discussion for larger scheduler changes

Imran Rashid-3
Kay and I were discussing some of the  bigger scheduler changes getting proposed lately, and realized there is a broader discussion to have with the community, outside of any single jira.  I'll start by sharing my initial thoughts, I know Kay has thoughts on this too, but it would be good to input from everyone.

In particular, SPARK-14649 & SPARK-13669 have got me thinking.  These are proposed changes in behavior that are not fixes for *correctness* in fault tolerance, but to improve the performance when there faults.  The changes make some intuitive sense, but its also hard to judge whether they are necessarily better; its hard to verify the correctness of the changes; and its hard to even know that we haven't broken the old behavior (because of how brittle the scheduler seems to be).

So I'm wondering:

1) in the short-term, can we find ways to get these changes merged, but turned off by default, in a way that we feel confident won't break existing code?

2) a bit longer-term -- should we be considering bigger rewrites to the scheduler?  Particularly, to improve testability?  eg., maybe if it was rewritten to more completely follow the actor model and eliminate shared state, the code would be cleaner and more testable.  Or maybe this is a crazy idea, and we'd just lose everything we'd learned so far and be stuck fixing the as many bugs in the new version.

Imran
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: planning & discussion for larger scheduler changes

rxin


On Fri, Mar 24, 2017 at 4:41 PM, Imran Rashid <[hidden email]> wrote:
Kay and I were discussing some of the  bigger scheduler changes getting proposed lately, and realized there is a broader discussion to have with the community, outside of any single jira.  I'll start by sharing my initial thoughts, I know Kay has thoughts on this too, but it would be good to input from everyone.

In particular, SPARK-14649 & SPARK-13669 have got me thinking.  These are proposed changes in behavior that are not fixes for *correctness* in fault tolerance, but to improve the performance when there faults.  The changes make some intuitive sense, but its also hard to judge whether they are necessarily better; its hard to verify the correctness of the changes; and its hard to even know that we haven't broken the old behavior (because of how brittle the scheduler seems to be).

So I'm wondering:

1) in the short-term, can we find ways to get these changes merged, but turned off by default, in a way that we feel confident won't break existing code?


+1

For risky features that's how we often do it. Feature flag it and turn it on later.

 

2) a bit longer-term -- should we be considering bigger rewrites to the scheduler?  Particularly, to improve testability?  eg., maybe if it was rewritten to more completely follow the actor model and eliminate shared state, the code would be cleaner and more testable.  Or maybe this is a crazy idea, and we'd just lose everything we'd learned so far and be stuck fixing the as many bugs in the new version.


This of course depends. Refactoring a large complicated piece of code is one of the most challenging tasks in engineering. It is extremely difficult to ensure things are correct even after that, especially in areas that don't have amazing test coverage.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: planning & discussion for larger scheduler changes

Tom Graves-2
In reply to this post by Imran Rashid-3
1) I think this depends on individual case by case jira.  I haven't looked in detail at spark-14649 seems much larger although more the way I think we want to go. While SPARK-13669 seems less risky and easily configurable.

2) I don't know whether it needs an entire rewrite but I think there need to be some major changes made especially in the handling of reduces and fetch failures.  We could do a much better job of not throwing away existing work and more optimally handling the failure cases.  For this would it make sense for us to start with a jira that has a bullet list of things we would like to improve and get more cohesive view and see really how invasive it would be?

Tom


On Friday, March 24, 2017 10:41 AM, Imran Rashid <[hidden email]> wrote:


Kay and I were discussing some of the  bigger scheduler changes getting proposed lately, and realized there is a broader discussion to have with the community, outside of any single jira.  I'll start by sharing my initial thoughts, I know Kay has thoughts on this too, but it would be good to input from everyone.

In particular, SPARK-14649 & SPARK-13669 have got me thinking.  These are proposed changes in behavior that are not fixes for *correctness* in fault tolerance, but to improve the performance when there faults.  The changes make some intuitive sense, but its also hard to judge whether they are necessarily better; its hard to verify the correctness of the changes; and its hard to even know that we haven't broken the old behavior (because of how brittle the scheduler seems to be).

So I'm wondering:

1) in the short-term, can we find ways to get these changes merged, but turned off by default, in a way that we feel confident won't break existing code?

2) a bit longer-term -- should we be considering bigger rewrites to the scheduler?  Particularly, to improve testability?  eg., maybe if it was rewritten to more completely follow the actor model and eliminate shared state, the code would be cleaner and more testable.  Or maybe this is a crazy idea, and we'd just lose everything we'd learned so far and be stuck fixing the as many bugs in the new version.

Imran


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: planning & discussion for larger scheduler changes

Kay Ousterhout
(1) I'm pretty hesitant to merge these larger changes, even if they're feature flagged, because:
   (a) For some of these changes, it's not obvious that they'll always improve performance. e.g., for SPARK-14649, it's possible that the tasks that got re-started (and temporarily are running in two places) are going to fail in the first attempt (because they haven't read the missing map output yet).  In that case, not re-starting them will lead to worse performance.
   (b) The scheduler already has some secret flags that aren't documented and are used by only a few people.  I'd like to avoid adding more of these (e.g., by merging these features, but having them off by default), because very few users use them (since it's hard to learn about them), they add complexity to the scheduler that we have to maintain, and for users who are considering using them, they often hide advanced behavior that's hard to reason about anyway (e.g., the point above for SPARK-14649). 
   (c) The worst performance problem is when jobs just hang or crash; we've seen a few cases of that in recent bugs, and I'm worried that merging these complex performance improvements trades better performance in a small number of cases for the possibility of worse performance via job crashes/hangs in other cases.

Roughly I think our standards for merging performance fixes to the scheduler should be that the performance improvement either (a) is simple / easy to reason about or (b) unambiguously fixes a serious performance problem.  In the case of SPARK-14649, for example, it is complex, and improves performance in some cases but hurts it in others, so doesn't fit either (a) or (b).

(2) I do think there are some scheduler re-factorings that would improve testability and our ability to reason about correctness, but think there are some what surgical, smaller things we could do in the vein of Imran's comment about reducing shared state.  Right now we have these super wide interfaces between different components of the scheduler, and it means you have to reason about the TSM, TSI, CGSB, and DAGSched to figure out whether something works.  I think we could have an effort to make each component have a much narrower interface, so that each part hides a bunch of complexity from other components.  The most obvious place to do this in the short term is to remove a bunch of info tracking from the DAGScheduler; I filed a JIRA for that here.  I suspect there are similar things that could be done in other parts of the scheduler.

Tom's comments re: (2) are more about performance improvements rather than readability / testability / debuggability, but also seem important and it does seem useful to have a JIRA tracking those.

-Kay

On Mon, Mar 27, 2017 at 11:06 AM, Tom Graves <[hidden email]> wrote:
1) I think this depends on individual case by case jira.  I haven't looked in detail at spark-14649 seems much larger although more the way I think we want to go. While SPARK-13669 seems less risky and easily configurable.

2) I don't know whether it needs an entire rewrite but I think there need to be some major changes made especially in the handling of reduces and fetch failures.  We could do a much better job of not throwing away existing work and more optimally handling the failure cases.  For this would it make sense for us to start with a jira that has a bullet list of things we would like to improve and get more cohesive view and see really how invasive it would be?

Tom


On Friday, March 24, 2017 10:41 AM, Imran Rashid <[hidden email]> wrote:


Kay and I were discussing some of the  bigger scheduler changes getting proposed lately, and realized there is a broader discussion to have with the community, outside of any single jira.  I'll start by sharing my initial thoughts, I know Kay has thoughts on this too, but it would be good to input from everyone.

In particular, SPARK-14649 & SPARK-13669 have got me thinking.  These are proposed changes in behavior that are not fixes for *correctness* in fault tolerance, but to improve the performance when there faults.  The changes make some intuitive sense, but its also hard to judge whether they are necessarily better; its hard to verify the correctness of the changes; and its hard to even know that we haven't broken the old behavior (because of how brittle the scheduler seems to be).

So I'm wondering:

1) in the short-term, can we find ways to get these changes merged, but turned off by default, in a way that we feel confident won't break existing code?

2) a bit longer-term -- should we be considering bigger rewrites to the scheduler?  Particularly, to improve testability?  eg., maybe if it was rewritten to more completely follow the actor model and eliminate shared state, the code would be cleaner and more testable.  Or maybe this is a crazy idea, and we'd just lose everything we'd learned so far and be stuck fixing the as many bugs in the new version.

Imran



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: planning & discussion for larger scheduler changes

Imran Rashid-3
Thanks for the responses all.  I may have worded my original email poorly -- I don't want to focus too much on SPARK-14649 and SPARK-13669 in particular, but more on how we should be approaching these changes.

On Mon, Mar 27, 2017 at 9:01 PM, Kay Ousterhout <[hidden email]> wrote:
(1) I'm pretty hesitant to merge these larger changes, even if they're feature flagged, because:
   (a) For some of these changes, it's not obvious that they'll always improve performance. e.g., for SPARK-14649, it's possible that the tasks that got re-started (and temporarily are running in two places) are going to fail in the first attempt (because they haven't read the missing map output yet).  In that case, not re-starting them will lead to worse performance.
   (b) The scheduler already has some secret flags that aren't documented and are used by only a few people.  I'd like to avoid adding more of these (e.g., by merging these features, but having them off by default), because very few users use them (since it's hard to learn about them), they add complexity to the scheduler that we have to maintain, and for users who are considering using them, they often hide advanced behavior that's hard to reason about anyway (e.g., the point above for SPARK-14649). 

We definitely need to evaluate each change on a case-by-case basis, and decide whether there really is a potential benefit worth the complexity.

But I'm actually wondering whether we need to be much more open on point (b).  The model spark currently uses has been sufficient so far, but we're increasingly see spark used on bigger clusters, with more varied deployment types, and I'm not convinced we really know what the right answer is.  its really hard to find behavior which will be optimal for everyone -- its not just looking at complexity or doing microbenchmarks.  Lots of configurations are a pain as you say -- complexity we have to manage in the code, and even for the user they have to be pretty advanced to understand how to use them.  But I'm just not confident that we know the right behavior for setups that range from small clusters; large clusters with heterogenous hardware, ongoing maintenance, and various levels of inhouse ops support; and spot instances in the cloud.

For example, lets say users start pushing spark on to more spot instances, and see 75% of executors die, but still want spark to make reasonable progress.  I can imagine wanting some large changes in behavior to support that.  I dont' think we're going to have enough info on the exact characteristics of spot instances and what the best behavior is -- there will probably be some point where we will turn users loose with some knobs, and hopefully after some experience in the wild, we can come up with recommended settings.  Furthermore, given that our tests alone dont' lead to a ton of confidence, it feels like we *have* to turn some of these things over to users to bang on for a while, with the old behavior still available to most users.

(I agree the benefit of SPARK-14649 in particular is unclear, and I'm not advocating one way or the other on that particular change here, just trying to setup the right frame of mind for considering it.)
 
   (c) The worst performance problem is when jobs just hang or crash; we've seen a few cases of that in recent bugs, and I'm worried that merging these complex performance improvements trades better performance in a small number of cases for the possibility of worse performance via job crashes/hangs in other cases.

I completely agree with this.   Obviously I'm contradicting my response to (b) above, which is why I'm torn.  I am not sure of a way to get those simultaneously, given the current state of the code.  Even under a feature-flag, the changes I'm thinking of are invasive enough that it introduces a risk of bugs even for the old behavior.
 
Roughly I think our standards for merging performance fixes to the scheduler should be that the performance improvement either (a) is simple / easy to reason about or (b) unambiguously fixes a serious performance problem.  In the case of SPARK-14649, for example, it is complex, and improves performance in some cases but hurts it in others, so doesn't fit either (a) or (b).

In the past, I would totally agree with you, in fact I think I've been one to advocate caution with changes.  But I'm slowly thinking that we need to be less strict about "unambiguously fixes a serious performance problem".  I would phrase it more like "clear demonstration of a use case with a significant performance problem".  I know I'm now quibbling over vague adjectives, but hopefully conveying my sentiment.

(2) I do think there are some scheduler re-factorings that would improve testability and our ability to reason about correctness, but think there are some what surgical, smaller things we could do in the vein of Imran's comment about reducing shared state.  Right now we have these super wide interfaces between different components of the scheduler, and it means you have to reason about the TSM, TSI, CGSB, and DAGSched to figure out whether something works.

don't forget the OutputCommitCoordinator :)
 
  I think we could have an effort to make each component have a much narrower interface, so that each part hides a bunch of complexity from other components.  The most obvious place to do this in the short term is to remove a bunch of info tracking from the DAGScheduler; I filed a JIRA for that here.  I suspect there are similar things that could be done in other parts of the scheduler.

yeah this seems like a good idea.  Hopefully that would also improve testability.
 
On Mon, Mar 27, 2017 at 11:06 AM, Tom Graves <[hidden email]> wrote:
I don't know whether it needs an entire rewrite but I think there need to be some major changes made especially in the handling of reduces and fetch failures.  We could do a much better job of not throwing away existing work and more optimally handling the failure cases.  For this would it make sense for us to start with a jira that has a bullet list of things we would like to improve and get more cohesive view and see really how invasive it would be?

I agree that sounds like a good idea.  I think Sital is going to take a shot at a design doc coming out of the discussion for changes related to SPARK-14649.  But it would be great to get multiple folks thinking about that list as everyone will have different use cases in mind.

thanks everyone for the input.

Imran
  
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: planning & discussion for larger scheduler changes

Tom Graves-2
In reply to this post by Imran Rashid-3
If we are worried about major changes destabilizing current code (which I can understand) only way around that is to make it pluggable or configurable.  For major changes it seems like making it pluggable is cleaner from a code being cluttered point of view. But it also means you may have to make the same or similar change in 2 places.
We could make the interfaces more well defined but if the major changes would require interfaces changes that doesn't help.  It still seems like if we had a list of things we would like to accomplish and get an idea of the rough overall design we could see if defining the interfaces better or making them pluggable would help.

There seem to be 3 jiras all related to handling fetch failures: https://issues.apache.org/jira/browse/SPARK-20091,  SPARK-14649 , and SPARK-19753.  It might be nice to create one epic jira where we think about a design as a whole and discuss that more.  Any objections to this?  If not I'll create an epic and link the others to it.


Tom

On Monday, March 27, 2017 9:01 PM, Kay Ousterhout <[hidden email]> wrote:


(1) I'm pretty hesitant to merge these larger changes, even if they're feature flagged, because:
   (a) For some of these changes, it's not obvious that they'll always improve performance. e.g., for SPARK-14649, it's possible that the tasks that got re-started (and temporarily are running in two places) are going to fail in the first attempt (because they haven't read the missing map output yet).  In that case, not re-starting them will lead to worse performance.
   (b) The scheduler already has some secret flags that aren't documented and are used by only a few people.  I'd like to avoid adding more of these (e.g., by merging these features, but having them off by default), because very few users use them (since it's hard to learn about them), they add complexity to the scheduler that we have to maintain, and for users who are considering using them, they often hide advanced behavior that's hard to reason about anyway (e.g., the point above for SPARK-14649). 
   (c) The worst performance problem is when jobs just hang or crash; we've seen a few cases of that in recent bugs, and I'm worried that merging these complex performance improvements trades better performance in a small number of cases for the possibility of worse performance via job crashes/hangs in other cases.

Roughly I think our standards for merging performance fixes to the scheduler should be that the performance improvement either (a) is simple / easy to reason about or (b) unambiguously fixes a serious performance problem.  In the case of SPARK-14649, for example, it is complex, and improves performance in some cases but hurts it in others, so doesn't fit either (a) or (b).

(2) I do think there are some scheduler re-factorings that would improve testability and our ability to reason about correctness, but think there are some what surgical, smaller things we could do in the vein of Imran's comment about reducing shared state.  Right now we have these super wide interfaces between different components of the scheduler, and it means you have to reason about the TSM, TSI, CGSB, and DAGSched to figure out whether something works.  I think we could have an effort to make each component have a much narrower interface, so that each part hides a bunch of complexity from other components.  The most obvious place to do this in the short term is to remove a bunch of info tracking from the DAGScheduler; I filed a JIRA for that here.  I suspect there are similar things that could be done in other parts of the scheduler.

Tom's comments re: (2) are more about performance improvements rather than readability / testability / debuggability, but also seem important and it does seem useful to have a JIRA tracking those.

-Kay

On Mon, Mar 27, 2017 at 11:06 AM, Tom Graves <[hidden email]> wrote:
1) I think this depends on individual case by case jira.  I haven't looked in detail at spark-14649 seems much larger although more the way I think we want to go. While SPARK-13669 seems less risky and easily configurable.

2) I don't know whether it needs an entire rewrite but I think there need to be some major changes made especially in the handling of reduces and fetch failures.  We could do a much better job of not throwing away existing work and more optimally handling the failure cases.  For this would it make sense for us to start with a jira that has a bullet list of things we would like to improve and get more cohesive view and see really how invasive it would be?

Tom


On Friday, March 24, 2017 10:41 AM, Imran Rashid <[hidden email]> wrote:


Kay and I were discussing some of the  bigger scheduler changes getting proposed lately, and realized there is a broader discussion to have with the community, outside of any single jira.  I'll start by sharing my initial thoughts, I know Kay has thoughts on this too, but it would be good to input from everyone.

In particular, SPARK-14649 & SPARK-13669 have got me thinking.  These are proposed changes in behavior that are not fixes for *correctness* in fault tolerance, but to improve the performance when there faults.  The changes make some intuitive sense, but its also hard to judge whether they are necessarily better; its hard to verify the correctness of the changes; and its hard to even know that we haven't broken the old behavior (because of how brittle the scheduler seems to be).

So I'm wondering:

1) in the short-term, can we find ways to get these changes merged, but turned off by default, in a way that we feel confident won't break existing code?

2) a bit longer-term -- should we be considering bigger rewrites to the scheduler?  Particularly, to improve testability?  eg., maybe if it was rewritten to more completely follow the actor model and eliminate shared state, the code would be cleaner and more testable.  Or maybe this is a crazy idea, and we'd just lose everything we'd learned so far and be stuck fixing the as many bugs in the new version.

Imran





Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: planning & discussion for larger scheduler changes

Tom Graves-2



Tom


On Thursday, March 30, 2017 1:21 PM, Tom Graves <[hidden email]> wrote:


If we are worried about major changes destabilizing current code (which I can understand) only way around that is to make it pluggable or configurable.  For major changes it seems like making it pluggable is cleaner from a code being cluttered point of view. But it also means you may have to make the same or similar change in 2 places.
We could make the interfaces more well defined but if the major changes would require interfaces changes that doesn't help.  It still seems like if we had a list of things we would like to accomplish and get an idea of the rough overall design we could see if defining the interfaces better or making them pluggable would help.

There seem to be 3 jiras all related to handling fetch failures: https://issues.apache.org/jira/browse/SPARK-20091,  SPARK-14649 , and SPARK-19753.  It might be nice to create one epic jira where we think about a design as a whole and discuss that more.  Any objections to this?  If not I'll create an epic and link the others to it.


Tom

On Monday, March 27, 2017 9:01 PM, Kay Ousterhout <[hidden email]> wrote:


(1) I'm pretty hesitant to merge these larger changes, even if they're feature flagged, because:
   (a) For some of these changes, it's not obvious that they'll always improve performance. e.g., for SPARK-14649, it's possible that the tasks that got re-started (and temporarily are running in two places) are going to fail in the first attempt (because they haven't read the missing map output yet).  In that case, not re-starting them will lead to worse performance.
   (b) The scheduler already has some secret flags that aren't documented and are used by only a few people.  I'd like to avoid adding more of these (e.g., by merging these features, but having them off by default), because very few users use them (since it's hard to learn about them), they add complexity to the scheduler that we have to maintain, and for users who are considering using them, they often hide advanced behavior that's hard to reason about anyway (e.g., the point above for SPARK-14649). 
   (c) The worst performance problem is when jobs just hang or crash; we've seen a few cases of that in recent bugs, and I'm worried that merging these complex performance improvements trades better performance in a small number of cases for the possibility of worse performance via job crashes/hangs in other cases.

Roughly I think our standards for merging performance fixes to the scheduler should be that the performance improvement either (a) is simple / easy to reason about or (b) unambiguously fixes a serious performance problem.  In the case of SPARK-14649, for example, it is complex, and improves performance in some cases but hurts it in others, so doesn't fit either (a) or (b).

(2) I do think there are some scheduler re-factorings that would improve testability and our ability to reason about correctness, but think there are some what surgical, smaller things we could do in the vein of Imran's comment about reducing shared state.  Right now we have these super wide interfaces between different components of the scheduler, and it means you have to reason about the TSM, TSI, CGSB, and DAGSched to figure out whether something works.  I think we could have an effort to make each component have a much narrower interface, so that each part hides a bunch of complexity from other components.  The most obvious place to do this in the short term is to remove a bunch of info tracking from the DAGScheduler; I filed a JIRA for that here.  I suspect there are similar things that could be done in other parts of the scheduler.

Tom's comments re: (2) are more about performance improvements rather than readability / testability / debuggability, but also seem important and it does seem useful to have a JIRA tracking those.

-Kay

On Mon, Mar 27, 2017 at 11:06 AM, Tom Graves <[hidden email]> wrote:
1) I think this depends on individual case by case jira.  I haven't looked in detail at spark-14649 seems much larger although more the way I think we want to go. While SPARK-13669 seems less risky and easily configurable.

2) I don't know whether it needs an entire rewrite but I think there need to be some major changes made especially in the handling of reduces and fetch failures.  We could do a much better job of not throwing away existing work and more optimally handling the failure cases.  For this would it make sense for us to start with a jira that has a bullet list of things we would like to improve and get more cohesive view and see really how invasive it would be?

Tom


On Friday, March 24, 2017 10:41 AM, Imran Rashid <[hidden email]> wrote:


Kay and I were discussing some of the  bigger scheduler changes getting proposed lately, and realized there is a broader discussion to have with the community, outside of any single jira.  I'll start by sharing my initial thoughts, I know Kay has thoughts on this too, but it would be good to input from everyone.

In particular, SPARK-14649 & SPARK-13669 have got me thinking.  These are proposed changes in behavior that are not fixes for *correctness* in fault tolerance, but to improve the performance when there faults.  The changes make some intuitive sense, but its also hard to judge whether they are necessarily better; its hard to verify the correctness of the changes; and its hard to even know that we haven't broken the old behavior (because of how brittle the scheduler seems to be).

So I'm wondering:

1) in the short-term, can we find ways to get these changes merged, but turned off by default, in a way that we feel confident won't break existing code?

2) a bit longer-term -- should we be considering bigger rewrites to the scheduler?  Particularly, to improve testability?  eg., maybe if it was rewritten to more completely follow the actor model and eliminate shared state, the code would be cleaner and more testable.  Or maybe this is a crazy idea, and we'd just lose everything we'd learned so far and be stuck fixing the as many bugs in the new version.

Imran







Loading...