Mesos checkpointing

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Mesos checkpointing

Charles Allen
As per https://issues.apache.org/jira/browse/SPARK-4899 org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils#createSchedulerDriver allows checkpointing, but only org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler uses it. Is there a reason for that?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mesos checkpointing

Timothy Chen
The only reason is that MesosClusterScheduler by design is long
running so we really needed it to have failover configured correctly.

I wanted to create a JIRA ticket to allow users to configure it for
each Spark framework, but just didn't remember to do so.

Per another question that came up in the mailing list, I believe we
should add it as it's a fairly straight forward effort.

Tim

On Mon, Apr 3, 2017 at 2:16 PM, Charles Allen
<[hidden email]> wrote:
> As per https://issues.apache.org/jira/browse/SPARK-4899
> org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils#createSchedulerDriver
> allows checkpointing, but only
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler uses it. Is
> there a reason for that?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mesos checkpointing

Charles Allen
We had investigated internally recently why restarting the mesos agents failed the spark jobs (no real reason they should, right?) and came across the data. The other conversation by Yu sparked trying to poke to get some of the tickets updated to spread around any tribal knowledge that is floating in the community.

It sounds like the only thing keeping it from being enabled is a timeout config and someone volunteering to do some testing?


On Mon, Apr 3, 2017 at 2:19 PM Timothy Chen <[hidden email]> wrote:
The only reason is that MesosClusterScheduler by design is long
running so we really needed it to have failover configured correctly.

I wanted to create a JIRA ticket to allow users to configure it for
each Spark framework, but just didn't remember to do so.

Per another question that came up in the mailing list, I believe we
should add it as it's a fairly straight forward effort.

Tim

On Mon, Apr 3, 2017 at 2:16 PM, Charles Allen
<[hidden email]> wrote:
> As per https://issues.apache.org/jira/browse/SPARK-4899
> org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils#createSchedulerDriver
> allows checkpointing, but only
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler uses it. Is
> there a reason for that?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mesos checkpointing

Timothy Chen
Yes, adding the timeout config should be the only code change required.

And just to clarify, this is for reconnecting with Mesos master (not
agents) after failover.

Tim

On Mon, Apr 3, 2017 at 2:23 PM, Charles Allen
<[hidden email]> wrote:

> We had investigated internally recently why restarting the mesos agents
> failed the spark jobs (no real reason they should, right?) and came across
> the data. The other conversation by Yu sparked trying to poke to get some of
> the tickets updated to spread around any tribal knowledge that is floating
> in the community.
>
> It sounds like the only thing keeping it from being enabled is a timeout
> config and someone volunteering to do some testing?
>
>
> On Mon, Apr 3, 2017 at 2:19 PM Timothy Chen <[hidden email]> wrote:
>>
>> The only reason is that MesosClusterScheduler by design is long
>> running so we really needed it to have failover configured correctly.
>>
>> I wanted to create a JIRA ticket to allow users to configure it for
>> each Spark framework, but just didn't remember to do so.
>>
>> Per another question that came up in the mailing list, I believe we
>> should add it as it's a fairly straight forward effort.
>>
>> Tim
>>
>> On Mon, Apr 3, 2017 at 2:16 PM, Charles Allen
>> <[hidden email]> wrote:
>> > As per https://issues.apache.org/jira/browse/SPARK-4899
>> >
>> > org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils#createSchedulerDriver
>> > allows checkpointing, but only
>> > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler uses it.
>> > Is
>> > there a reason for that?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mesos checkpointing

michaelgummelt
> We had investigated internally recently why restarting the mesos agents failed the spark jobs (no real reason they should, right?) and came across the data.

Restarting the agent without checkpointing enabled will kill the executor, but that still shouldn't cause the Spark job to fail, since Spark jobs should tolerate executor failures.

On Mon, Apr 3, 2017 at 2:26 PM, Timothy Chen <[hidden email]> wrote:
Yes, adding the timeout config should be the only code change required.

And just to clarify, this is for reconnecting with Mesos master (not
agents) after failover.

Tim

On Mon, Apr 3, 2017 at 2:23 PM, Charles Allen
<[hidden email]> wrote:
> We had investigated internally recently why restarting the mesos agents
> failed the spark jobs (no real reason they should, right?) and came across
> the data. The other conversation by Yu sparked trying to poke to get some of
> the tickets updated to spread around any tribal knowledge that is floating
> in the community.
>
> It sounds like the only thing keeping it from being enabled is a timeout
> config and someone volunteering to do some testing?
>
>
> On Mon, Apr 3, 2017 at 2:19 PM Timothy Chen <[hidden email]> wrote:
>>
>> The only reason is that MesosClusterScheduler by design is long
>> running so we really needed it to have failover configured correctly.
>>
>> I wanted to create a JIRA ticket to allow users to configure it for
>> each Spark framework, but just didn't remember to do so.
>>
>> Per another question that came up in the mailing list, I believe we
>> should add it as it's a fairly straight forward effort.
>>
>> Tim
>>
>> On Mon, Apr 3, 2017 at 2:16 PM, Charles Allen
>> <[hidden email]> wrote:
>> > As per https://issues.apache.org/jira/browse/SPARK-4899
>> >
>> > org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils#createSchedulerDriver
>> > allows checkpointing, but only
>> > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler uses it.
>> > Is
>> > there a reason for that?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Michael Gummelt
Software Engineer
Mesosphere
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mesos checkpointing

Charles Allen
The issue on our side is we tend to roll out a bunch of agent updates at about the same time. So rolling an agent, then waiting for spark jobs to recover, then rolling another agent is not at all practical. It is a huge benefit if we can just update the agents in bulk (or even sequentially, but only waiting for the mesos agent to recover).

On Wed, May 24, 2017 at 11:17 AM Michael Gummelt <[hidden email]> wrote:
> We had investigated internally recently why restarting the mesos agents failed the spark jobs (no real reason they should, right?) and came across the data.

Restarting the agent without checkpointing enabled will kill the executor, but that still shouldn't cause the Spark job to fail, since Spark jobs should tolerate executor failures.

On Mon, Apr 3, 2017 at 2:26 PM, Timothy Chen <[hidden email]> wrote:
Yes, adding the timeout config should be the only code change required.

And just to clarify, this is for reconnecting with Mesos master (not
agents) after failover.

Tim

On Mon, Apr 3, 2017 at 2:23 PM, Charles Allen
<[hidden email]> wrote:
> We had investigated internally recently why restarting the mesos agents
> failed the spark jobs (no real reason they should, right?) and came across
> the data. The other conversation by Yu sparked trying to poke to get some of
> the tickets updated to spread around any tribal knowledge that is floating
> in the community.
>
> It sounds like the only thing keeping it from being enabled is a timeout
> config and someone volunteering to do some testing?
>
>
> On Mon, Apr 3, 2017 at 2:19 PM Timothy Chen <[hidden email]> wrote:
>>
>> The only reason is that MesosClusterScheduler by design is long
>> running so we really needed it to have failover configured correctly.
>>
>> I wanted to create a JIRA ticket to allow users to configure it for
>> each Spark framework, but just didn't remember to do so.
>>
>> Per another question that came up in the mailing list, I believe we
>> should add it as it's a fairly straight forward effort.
>>
>> Tim
>>
>> On Mon, Apr 3, 2017 at 2:16 PM, Charles Allen
>> <[hidden email]> wrote:
>> > As per https://issues.apache.org/jira/browse/SPARK-4899
>> >
>> > org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils#createSchedulerDriver
>> > allows checkpointing, but only
>> > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler uses it.
>> > Is
>> > there a reason for that?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Michael Gummelt
Software Engineer
Mesosphere
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mesos checkpointing

michaelgummelt
Ah, then yea, checkpointing should solve your problem.  Let's do that.

On Wed, May 24, 2017 at 11:19 AM, Charles Allen <[hidden email]> wrote:
The issue on our side is we tend to roll out a bunch of agent updates at about the same time. So rolling an agent, then waiting for spark jobs to recover, then rolling another agent is not at all practical. It is a huge benefit if we can just update the agents in bulk (or even sequentially, but only waiting for the mesos agent to recover).

On Wed, May 24, 2017 at 11:17 AM Michael Gummelt <[hidden email]> wrote:
> We had investigated internally recently why restarting the mesos agents failed the spark jobs (no real reason they should, right?) and came across the data.

Restarting the agent without checkpointing enabled will kill the executor, but that still shouldn't cause the Spark job to fail, since Spark jobs should tolerate executor failures.

On Mon, Apr 3, 2017 at 2:26 PM, Timothy Chen <[hidden email]> wrote:
Yes, adding the timeout config should be the only code change required.

And just to clarify, this is for reconnecting with Mesos master (not
agents) after failover.

Tim

On Mon, Apr 3, 2017 at 2:23 PM, Charles Allen
<[hidden email]> wrote:
> We had investigated internally recently why restarting the mesos agents
> failed the spark jobs (no real reason they should, right?) and came across
> the data. The other conversation by Yu sparked trying to poke to get some of
> the tickets updated to spread around any tribal knowledge that is floating
> in the community.
>
> It sounds like the only thing keeping it from being enabled is a timeout
> config and someone volunteering to do some testing?
>
>
> On Mon, Apr 3, 2017 at 2:19 PM Timothy Chen <[hidden email]> wrote:
>>
>> The only reason is that MesosClusterScheduler by design is long
>> running so we really needed it to have failover configured correctly.
>>
>> I wanted to create a JIRA ticket to allow users to configure it for
>> each Spark framework, but just didn't remember to do so.
>>
>> Per another question that came up in the mailing list, I believe we
>> should add it as it's a fairly straight forward effort.
>>
>> Tim
>>
>> On Mon, Apr 3, 2017 at 2:16 PM, Charles Allen
>> <[hidden email]> wrote:
>> > As per https://issues.apache.org/jira/browse/SPARK-4899
>> >
>> > org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils#createSchedulerDriver
>> > allows checkpointing, but only
>> > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler uses it.
>> > Is
>> > there a reason for that?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Michael Gummelt
Software Engineer
Mesosphere



--
Michael Gummelt
Software Engineer
Mesosphere
Loading...