[build system] jenkins got itself wedged...

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[build system] jenkins got itself wedged...

shane knapp
...so i kicked it and it's now back up and happily building.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [build system] jenkins got itself wedged...

Herman van Hövell tot Westerflier-2
Thanks Shane!

On Tue, May 16, 2017 at 5:18 PM, shane knapp <[hidden email]> wrote:
...so i kicked it and it's now back up and happily building.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--

Herman van Hövell

Software Engineer

Databricks Inc.

[hidden email]

+31 6 420 590 27

databricks.com

http://databricks.com


Join Databricks at Spark Summit 2017 in San Francisco, the world's largest event for the Apache Spark community.

Reply | Threaded
Open this post in threaded view
|

Re: [build system] jenkins got itself wedged...

shane knapp
In reply to this post by shane knapp
...but just now i started getting alerts on system load, which was
rather high.  i had to kick jenkins again, and will keep an eye on the
master and possible need to reboot.

sorry about the interruption of service...

shane

On Tue, May 16, 2017 at 8:18 AM, shane knapp <[hidden email]> wrote:
> ...so i kicked it and it's now back up and happily building.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [build system] jenkins got itself wedged...

shane knapp
i'm going to need to perform a quick reboot on the jenkins master.  it
looks like it's hung again.

sorry about this!

shane

On Tue, May 16, 2017 at 12:55 PM, shane knapp <[hidden email]> wrote:

> ...but just now i started getting alerts on system load, which was
> rather high.  i had to kick jenkins again, and will keep an eye on the
> master and possible need to reboot.
>
> sorry about the interruption of service...
>
> shane
>
> On Tue, May 16, 2017 at 8:18 AM, shane knapp <[hidden email]> wrote:
>> ...so i kicked it and it's now back up and happily building.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [build system] jenkins got itself wedged...

shane knapp
ok, we're back up, system load looks cromulent and we're happily
building (again).

shane

On Wed, May 17, 2017 at 9:50 AM, shane knapp <[hidden email]> wrote:

> i'm going to need to perform a quick reboot on the jenkins master.  it
> looks like it's hung again.
>
> sorry about this!
>
> shane
>
> On Tue, May 16, 2017 at 12:55 PM, shane knapp <[hidden email]> wrote:
>> ...but just now i started getting alerts on system load, which was
>> rather high.  i had to kick jenkins again, and will keep an eye on the
>> master and possible need to reboot.
>>
>> sorry about the interruption of service...
>>
>> shane
>>
>> On Tue, May 16, 2017 at 8:18 AM, shane knapp <[hidden email]> wrote:
>>> ...so i kicked it and it's now back up and happily building.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [build system] jenkins got itself wedged...

shane knapp
after another couple of restarts due to high load and system
unresponsiveness, i finally found what is the most likely culprit:

a typo in the jenkins config where the java heap size was configured.
instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
random and non-deterministic system hangs we've had over the past
couple of years.

anyways, it's been corrected and the master seems to be humming along,
for real this time, w/o issue.  i'll continue to keep an eye on this
for the rest of the week, but things are looking MUCH better now.

sorry again for the interruptions in service.

shane

On Wed, May 17, 2017 at 9:59 AM, shane knapp <[hidden email]> wrote:

> ok, we're back up, system load looks cromulent and we're happily
> building (again).
>
> shane
>
> On Wed, May 17, 2017 at 9:50 AM, shane knapp <[hidden email]> wrote:
>> i'm going to need to perform a quick reboot on the jenkins master.  it
>> looks like it's hung again.
>>
>> sorry about this!
>>
>> shane
>>
>> On Tue, May 16, 2017 at 12:55 PM, shane knapp <[hidden email]> wrote:
>>> ...but just now i started getting alerts on system load, which was
>>> rather high.  i had to kick jenkins again, and will keep an eye on the
>>> master and possible need to reboot.
>>>
>>> sorry about the interruption of service...
>>>
>>> shane
>>>
>>> On Tue, May 16, 2017 at 8:18 AM, shane knapp <[hidden email]> wrote:
>>>> ...so i kicked it and it's now back up and happily building.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [build system] jenkins got itself wedged...

Sean Owen
I'm not sure if it's related, but I still can't get Jenkins to test PRs. For example, triggering it through the spark-prs.appspot.com UI gives me...

Internal Server Error

That might be from the appspot app though?

But posting "Jenkins test this please" on PRs doesn't seem to work, and I can't reach Jenkins:

On Thu, May 18, 2017 at 12:44 AM shane knapp <[hidden email]> wrote:
after another couple of restarts due to high load and system
unresponsiveness, i finally found what is the most likely culprit:

a typo in the jenkins config where the java heap size was configured.
instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
random and non-deterministic system hangs we've had over the past
couple of years.

anyways, it's been corrected and the master seems to be humming along,
for real this time, w/o issue.  i'll continue to keep an eye on this
for the rest of the week, but things are looking MUCH better now.

sorry again for the interruptions in service.

shane

On Wed, May 17, 2017 at 9:59 AM, shane knapp <[hidden email]> wrote:
> ok, we're back up, system load looks cromulent and we're happily
> building (again).
>
> shane
>
> On Wed, May 17, 2017 at 9:50 AM, shane knapp <[hidden email]> wrote:
>> i'm going to need to perform a quick reboot on the jenkins master.  it
>> looks like it's hung again.
>>
>> sorry about this!
>>
>> shane
>>
>> On Tue, May 16, 2017 at 12:55 PM, shane knapp <[hidden email]> wrote:
>>> ...but just now i started getting alerts on system load, which was
>>> rather high.  i had to kick jenkins again, and will keep an eye on the
>>> master and possible need to reboot.
>>>
>>> sorry about the interruption of service...
>>>
>>> shane
>>>
>>> On Tue, May 16, 2017 at 8:18 AM, shane knapp <[hidden email]> wrote:
>>>> ...so i kicked it and it's now back up and happily building.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [build system] jenkins got itself wedged...

shane knapp
yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
getting some error messages in the logs...   looks like jenkins is
thrashing on GC.

now that i know what's up, i should be able to get this sorted today.

On Thu, May 18, 2017 at 12:39 AM, Sean Owen <[hidden email]> wrote:

> I'm not sure if it's related, but I still can't get Jenkins to test PRs. For
> example, triggering it through the spark-prs.appspot.com UI gives me...
>
> https://spark-prs.appspot.com/trigger-jenkins/18012
>
> Internal Server Error
>
> That might be from the appspot app though?
>
> But posting "Jenkins test this please" on PRs doesn't seem to work, and I
> can't reach Jenkins:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>
> On Thu, May 18, 2017 at 12:44 AM shane knapp <[hidden email]> wrote:
>>
>> after another couple of restarts due to high load and system
>> unresponsiveness, i finally found what is the most likely culprit:
>>
>> a typo in the jenkins config where the java heap size was configured.
>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
>> random and non-deterministic system hangs we've had over the past
>> couple of years.
>>
>> anyways, it's been corrected and the master seems to be humming along,
>> for real this time, w/o issue.  i'll continue to keep an eye on this
>> for the rest of the week, but things are looking MUCH better now.
>>
>> sorry again for the interruptions in service.
>>
>> shane
>>
>> On Wed, May 17, 2017 at 9:59 AM, shane knapp <[hidden email]> wrote:
>> > ok, we're back up, system load looks cromulent and we're happily
>> > building (again).
>> >
>> > shane
>> >
>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp <[hidden email]>
>> > wrote:
>> >> i'm going to need to perform a quick reboot on the jenkins master.  it
>> >> looks like it's hung again.
>> >>
>> >> sorry about this!
>> >>
>> >> shane
>> >>
>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp <[hidden email]>
>> >> wrote:
>> >>> ...but just now i started getting alerts on system load, which was
>> >>> rather high.  i had to kick jenkins again, and will keep an eye on the
>> >>> master and possible need to reboot.
>> >>>
>> >>> sorry about the interruption of service...
>> >>>
>> >>> shane
>> >>>
>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp <[hidden email]>
>> >>> wrote:
>> >>>> ...so i kicked it and it's now back up and happily building.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [build system] jenkins got itself wedged...

shane knapp
ok, more updates:

1) i audited all of the builds, and found that the spark-*-compile-*
and spark-*-test-* jobs were set to the identical cron time trigger,
so josh rosen and i updated them to run at H/5 (instead of */5).  load
balancing ftw.

2) the jenkins master is now running on java8, which has moar bettar
GC management under the hood.

i'll be keeping an eye on this today, and if we start seeing GC
overhead failures, i'll start doing more GC performance tuning.
thankfully, cloudbees has a relatively decent guide that i'll be
following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/

shane

On Thu, May 18, 2017 at 8:39 AM, shane knapp <[hidden email]> wrote:

> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
> getting some error messages in the logs...   looks like jenkins is
> thrashing on GC.
>
> now that i know what's up, i should be able to get this sorted today.
>
> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <[hidden email]> wrote:
>> I'm not sure if it's related, but I still can't get Jenkins to test PRs. For
>> example, triggering it through the spark-prs.appspot.com UI gives me...
>>
>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>
>> Internal Server Error
>>
>> That might be from the appspot app though?
>>
>> But posting "Jenkins test this please" on PRs doesn't seem to work, and I
>> can't reach Jenkins:
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>
>> On Thu, May 18, 2017 at 12:44 AM shane knapp <[hidden email]> wrote:
>>>
>>> after another couple of restarts due to high load and system
>>> unresponsiveness, i finally found what is the most likely culprit:
>>>
>>> a typo in the jenkins config where the java heap size was configured.
>>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
>>> random and non-deterministic system hangs we've had over the past
>>> couple of years.
>>>
>>> anyways, it's been corrected and the master seems to be humming along,
>>> for real this time, w/o issue.  i'll continue to keep an eye on this
>>> for the rest of the week, but things are looking MUCH better now.
>>>
>>> sorry again for the interruptions in service.
>>>
>>> shane
>>>
>>> On Wed, May 17, 2017 at 9:59 AM, shane knapp <[hidden email]> wrote:
>>> > ok, we're back up, system load looks cromulent and we're happily
>>> > building (again).
>>> >
>>> > shane
>>> >
>>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp <[hidden email]>
>>> > wrote:
>>> >> i'm going to need to perform a quick reboot on the jenkins master.  it
>>> >> looks like it's hung again.
>>> >>
>>> >> sorry about this!
>>> >>
>>> >> shane
>>> >>
>>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp <[hidden email]>
>>> >> wrote:
>>> >>> ...but just now i started getting alerts on system load, which was
>>> >>> rather high.  i had to kick jenkins again, and will keep an eye on the
>>> >>> master and possible need to reboot.
>>> >>>
>>> >>> sorry about the interruption of service...
>>> >>>
>>> >>> shane
>>> >>>
>>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp <[hidden email]>
>>> >>> wrote:
>>> >>>> ...so i kicked it and it's now back up and happily building.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>>
>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [build system] jenkins got itself wedged...

shane knapp
after needing another restart this afternoon, i did some homework and
aggressively twiddled some GC settings[1].  since then, things have
definitely smoothed out w/regards to memory and cpu usage spikes.

i've attached a screenshot of slightly happier looking graphs.

still keeping an eye on things, and hoping that i can go back to being
a lurker...  ;)

shane

1 - https://jenkins.io/blog/2016/11/21/gc-tuning/

On Thu, May 18, 2017 at 11:20 AM, shane knapp <[hidden email]> wrote:

> ok, more updates:
>
> 1) i audited all of the builds, and found that the spark-*-compile-*
> and spark-*-test-* jobs were set to the identical cron time trigger,
> so josh rosen and i updated them to run at H/5 (instead of */5).  load
> balancing ftw.
>
> 2) the jenkins master is now running on java8, which has moar bettar
> GC management under the hood.
>
> i'll be keeping an eye on this today, and if we start seeing GC
> overhead failures, i'll start doing more GC performance tuning.
> thankfully, cloudbees has a relatively decent guide that i'll be
> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>
> shane
>
> On Thu, May 18, 2017 at 8:39 AM, shane knapp <[hidden email]> wrote:
>> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
>> getting some error messages in the logs...   looks like jenkins is
>> thrashing on GC.
>>
>> now that i know what's up, i should be able to get this sorted today.
>>
>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <[hidden email]> wrote:
>>> I'm not sure if it's related, but I still can't get Jenkins to test PRs. For
>>> example, triggering it through the spark-prs.appspot.com UI gives me...
>>>
>>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>>
>>> Internal Server Error
>>>
>>> That might be from the appspot app though?
>>>
>>> But posting "Jenkins test this please" on PRs doesn't seem to work, and I
>>> can't reach Jenkins:
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>>
>>> On Thu, May 18, 2017 at 12:44 AM shane knapp <[hidden email]> wrote:
>>>>
>>>> after another couple of restarts due to high load and system
>>>> unresponsiveness, i finally found what is the most likely culprit:
>>>>
>>>> a typo in the jenkins config where the java heap size was configured.
>>>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
>>>> random and non-deterministic system hangs we've had over the past
>>>> couple of years.
>>>>
>>>> anyways, it's been corrected and the master seems to be humming along,
>>>> for real this time, w/o issue.  i'll continue to keep an eye on this
>>>> for the rest of the week, but things are looking MUCH better now.
>>>>
>>>> sorry again for the interruptions in service.
>>>>
>>>> shane
>>>>
>>>> On Wed, May 17, 2017 at 9:59 AM, shane knapp <[hidden email]> wrote:
>>>> > ok, we're back up, system load looks cromulent and we're happily
>>>> > building (again).
>>>> >
>>>> > shane
>>>> >
>>>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp <[hidden email]>
>>>> > wrote:
>>>> >> i'm going to need to perform a quick reboot on the jenkins master.  it
>>>> >> looks like it's hung again.
>>>> >>
>>>> >> sorry about this!
>>>> >>
>>>> >> shane
>>>> >>
>>>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp <[hidden email]>
>>>> >> wrote:
>>>> >>> ...but just now i started getting alerts on system load, which was
>>>> >>> rather high.  i had to kick jenkins again, and will keep an eye on the
>>>> >>> master and possible need to reboot.
>>>> >>>
>>>> >>> sorry about the interruption of service...
>>>> >>>
>>>> >>> shane
>>>> >>>
>>>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp <[hidden email]>
>>>> >>> wrote:
>>>> >>>> ...so i kicked it and it's now back up and happily building.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: [hidden email]
>>>>
>>>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Screen Shot 2017-05-18 at 6.39.27 PM.png (529K) Download Attachment
Screen Shot 2017-05-18 at 6.39.03 PM.png (846K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [build system] jenkins got itself wedged...

shane knapp
this is hopefully my final email on the subject...   :)

things have seemed to settled down after my GC tuning, and system
load/cpu usage/memory has been nice and flat all night.  i'll continue
to keep an eye on things but it looks like we've weathered the worst
part of the storm.

On Thu, May 18, 2017 at 6:40 PM, shane knapp <[hidden email]> wrote:

> after needing another restart this afternoon, i did some homework and
> aggressively twiddled some GC settings[1].  since then, things have
> definitely smoothed out w/regards to memory and cpu usage spikes.
>
> i've attached a screenshot of slightly happier looking graphs.
>
> still keeping an eye on things, and hoping that i can go back to being
> a lurker...  ;)
>
> shane
>
> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/
>
> On Thu, May 18, 2017 at 11:20 AM, shane knapp <[hidden email]> wrote:
>> ok, more updates:
>>
>> 1) i audited all of the builds, and found that the spark-*-compile-*
>> and spark-*-test-* jobs were set to the identical cron time trigger,
>> so josh rosen and i updated them to run at H/5 (instead of */5).  load
>> balancing ftw.
>>
>> 2) the jenkins master is now running on java8, which has moar bettar
>> GC management under the hood.
>>
>> i'll be keeping an eye on this today, and if we start seeing GC
>> overhead failures, i'll start doing more GC performance tuning.
>> thankfully, cloudbees has a relatively decent guide that i'll be
>> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>>
>> shane
>>
>> On Thu, May 18, 2017 at 8:39 AM, shane knapp <[hidden email]> wrote:
>>> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
>>> getting some error messages in the logs...   looks like jenkins is
>>> thrashing on GC.
>>>
>>> now that i know what's up, i should be able to get this sorted today.
>>>
>>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <[hidden email]> wrote:
>>>> I'm not sure if it's related, but I still can't get Jenkins to test PRs. For
>>>> example, triggering it through the spark-prs.appspot.com UI gives me...
>>>>
>>>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>>>
>>>> Internal Server Error
>>>>
>>>> That might be from the appspot app though?
>>>>
>>>> But posting "Jenkins test this please" on PRs doesn't seem to work, and I
>>>> can't reach Jenkins:
>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>>>
>>>> On Thu, May 18, 2017 at 12:44 AM shane knapp <[hidden email]> wrote:
>>>>>
>>>>> after another couple of restarts due to high load and system
>>>>> unresponsiveness, i finally found what is the most likely culprit:
>>>>>
>>>>> a typo in the jenkins config where the java heap size was configured.
>>>>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
>>>>> random and non-deterministic system hangs we've had over the past
>>>>> couple of years.
>>>>>
>>>>> anyways, it's been corrected and the master seems to be humming along,
>>>>> for real this time, w/o issue.  i'll continue to keep an eye on this
>>>>> for the rest of the week, but things are looking MUCH better now.
>>>>>
>>>>> sorry again for the interruptions in service.
>>>>>
>>>>> shane
>>>>>
>>>>> On Wed, May 17, 2017 at 9:59 AM, shane knapp <[hidden email]> wrote:
>>>>> > ok, we're back up, system load looks cromulent and we're happily
>>>>> > building (again).
>>>>> >
>>>>> > shane
>>>>> >
>>>>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp <[hidden email]>
>>>>> > wrote:
>>>>> >> i'm going to need to perform a quick reboot on the jenkins master.  it
>>>>> >> looks like it's hung again.
>>>>> >>
>>>>> >> sorry about this!
>>>>> >>
>>>>> >> shane
>>>>> >>
>>>>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp <[hidden email]>
>>>>> >> wrote:
>>>>> >>> ...but just now i started getting alerts on system load, which was
>>>>> >>> rather high.  i had to kick jenkins again, and will keep an eye on the
>>>>> >>> master and possible need to reboot.
>>>>> >>>
>>>>> >>> sorry about the interruption of service...
>>>>> >>>
>>>>> >>> shane
>>>>> >>>
>>>>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp <[hidden email]>
>>>>> >>> wrote:
>>>>> >>>> ...so i kicked it and it's now back up and happily building.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: [hidden email]
>>>>>
>>>>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Screen Shot 2017-05-19 at 8.28.53 AM.png (761K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [build system] jenkins got itself wedged...

shane knapp
last update of the week:

things are looking great...  we're GCing happily and staying well
within our memory limits.

i'm going to do one more restart after the two pull request builds
finish to re-enable backups, and call it a weekend.  :)

shane

On Fri, May 19, 2017 at 8:29 AM, shane knapp <[hidden email]> wrote:

> this is hopefully my final email on the subject...   :)
>
> things have seemed to settled down after my GC tuning, and system
> load/cpu usage/memory has been nice and flat all night.  i'll continue
> to keep an eye on things but it looks like we've weathered the worst
> part of the storm.
>
> On Thu, May 18, 2017 at 6:40 PM, shane knapp <[hidden email]> wrote:
>> after needing another restart this afternoon, i did some homework and
>> aggressively twiddled some GC settings[1].  since then, things have
>> definitely smoothed out w/regards to memory and cpu usage spikes.
>>
>> i've attached a screenshot of slightly happier looking graphs.
>>
>> still keeping an eye on things, and hoping that i can go back to being
>> a lurker...  ;)
>>
>> shane
>>
>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/
>>
>> On Thu, May 18, 2017 at 11:20 AM, shane knapp <[hidden email]> wrote:
>>> ok, more updates:
>>>
>>> 1) i audited all of the builds, and found that the spark-*-compile-*
>>> and spark-*-test-* jobs were set to the identical cron time trigger,
>>> so josh rosen and i updated them to run at H/5 (instead of */5).  load
>>> balancing ftw.
>>>
>>> 2) the jenkins master is now running on java8, which has moar bettar
>>> GC management under the hood.
>>>
>>> i'll be keeping an eye on this today, and if we start seeing GC
>>> overhead failures, i'll start doing more GC performance tuning.
>>> thankfully, cloudbees has a relatively decent guide that i'll be
>>> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>
>>> shane
>>>
>>> On Thu, May 18, 2017 at 8:39 AM, shane knapp <[hidden email]> wrote:
>>>> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
>>>> getting some error messages in the logs...   looks like jenkins is
>>>> thrashing on GC.
>>>>
>>>> now that i know what's up, i should be able to get this sorted today.
>>>>
>>>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <[hidden email]> wrote:
>>>>> I'm not sure if it's related, but I still can't get Jenkins to test PRs. For
>>>>> example, triggering it through the spark-prs.appspot.com UI gives me...
>>>>>
>>>>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>>>>
>>>>> Internal Server Error
>>>>>
>>>>> That might be from the appspot app though?
>>>>>
>>>>> But posting "Jenkins test this please" on PRs doesn't seem to work, and I
>>>>> can't reach Jenkins:
>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>>>>
>>>>> On Thu, May 18, 2017 at 12:44 AM shane knapp <[hidden email]> wrote:
>>>>>>
>>>>>> after another couple of restarts due to high load and system
>>>>>> unresponsiveness, i finally found what is the most likely culprit:
>>>>>>
>>>>>> a typo in the jenkins config where the java heap size was configured.
>>>>>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
>>>>>> random and non-deterministic system hangs we've had over the past
>>>>>> couple of years.
>>>>>>
>>>>>> anyways, it's been corrected and the master seems to be humming along,
>>>>>> for real this time, w/o issue.  i'll continue to keep an eye on this
>>>>>> for the rest of the week, but things are looking MUCH better now.
>>>>>>
>>>>>> sorry again for the interruptions in service.
>>>>>>
>>>>>> shane
>>>>>>
>>>>>> On Wed, May 17, 2017 at 9:59 AM, shane knapp <[hidden email]> wrote:
>>>>>> > ok, we're back up, system load looks cromulent and we're happily
>>>>>> > building (again).
>>>>>> >
>>>>>> > shane
>>>>>> >
>>>>>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp <[hidden email]>
>>>>>> > wrote:
>>>>>> >> i'm going to need to perform a quick reboot on the jenkins master.  it
>>>>>> >> looks like it's hung again.
>>>>>> >>
>>>>>> >> sorry about this!
>>>>>> >>
>>>>>> >> shane
>>>>>> >>
>>>>>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp <[hidden email]>
>>>>>> >> wrote:
>>>>>> >>> ...but just now i started getting alerts on system load, which was
>>>>>> >>> rather high.  i had to kick jenkins again, and will keep an eye on the
>>>>>> >>> master and possible need to reboot.
>>>>>> >>>
>>>>>> >>> sorry about the interruption of service...
>>>>>> >>>
>>>>>> >>> shane
>>>>>> >>>
>>>>>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp <[hidden email]>
>>>>>> >>> wrote:
>>>>>> >>>> ...so i kicked it and it's now back up and happily building.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: [hidden email]
>>>>>>
>>>>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [build system] jenkins got itself wedged...

Kazuaki Ishizaki
It looked well these days. However, it seems to go down slowly again...

When I tried to see console log (e.g. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull), a server returns "proxy error."

Regards,
Kazuaki Ishizaki



From:        shane knapp <[hidden email]>
To:        Sean Owen <[hidden email]>
Cc:        dev <[hidden email]>
Date:        2017/05/20 09:43
Subject:        Re: [build system] jenkins got itself wedged...




last update of the week:

things are looking great...  we're GCing happily and staying well
within our memory limits.

i'm going to do one more restart after the two pull request builds
finish to re-enable backups, and call it a weekend.  :)

shane

On Fri, May 19, 2017 at 8:29 AM, shane knapp <[hidden email]> wrote:

> this is hopefully my final email on the subject...   :)
>
> things have seemed to settled down after my GC tuning, and system
> load/cpu usage/memory has been nice and flat all night.  i'll continue
> to keep an eye on things but it looks like we've weathered the worst
> part of the storm.
>
> On Thu, May 18, 2017 at 6:40 PM, shane knapp <[hidden email]> wrote:
>> after needing another restart this afternoon, i did some homework and
>> aggressively twiddled some GC settings[1].  since then, things have
>> definitely smoothed out w/regards to memory and cpu usage spikes.
>>
>> i've attached a screenshot of slightly happier looking graphs.
>>
>> still keeping an eye on things, and hoping that i can go back to being
>> a lurker...  ;)
>>
>> shane
>>
>> 1 -
https://jenkins.io/blog/2016/11/21/gc-tuning/
>>
>> On Thu, May 18, 2017 at 11:20 AM, shane knapp <[hidden email]> wrote:
>>> ok, more updates:
>>>
>>> 1) i audited all of the builds, and found that the spark-*-compile-*
>>> and spark-*-test-* jobs were set to the identical cron time trigger,
>>> so josh rosen and i updated them to run at H/5 (instead of */5).  load
>>> balancing ftw.
>>>
>>> 2) the jenkins master is now running on java8, which has moar bettar
>>> GC management under the hood.
>>>
>>> i'll be keeping an eye on this today, and if we start seeing GC
>>> overhead failures, i'll start doing more GC performance tuning.
>>> thankfully, cloudbees has a relatively decent guide that i'll be
>>> following here:  
https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>
>>> shane
>>>
>>> On Thu, May 18, 2017 at 8:39 AM, shane knapp <[hidden email]> wrote:
>>>> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
>>>> getting some error messages in the logs...   looks like jenkins is
>>>> thrashing on GC.
>>>>
>>>> now that i know what's up, i should be able to get this sorted today.
>>>>
>>>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <[hidden email]> wrote:
>>>>> I'm not sure if it's related, but I still can't get Jenkins to test PRs. For
>>>>> example, triggering it through the spark-prs.appspot.com UI gives me...
>>>>>
>>>>>
https://spark-prs.appspot.com/trigger-jenkins/18012
>>>>>
>>>>> Internal Server Error
>>>>>
>>>>> That might be from the appspot app though?
>>>>>
>>>>> But posting "Jenkins test this please" on PRs doesn't seem to work, and I
>>>>> can't reach Jenkins:
>>>>>
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>>>>
>>>>> On Thu, May 18, 2017 at 12:44 AM shane knapp <[hidden email]> wrote:
>>>>>>
>>>>>> after another couple of restarts due to high load and system
>>>>>> unresponsiveness, i finally found what is the most likely culprit:
>>>>>>
>>>>>> a typo in the jenkins config where the java heap size was configured.
>>>>>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
>>>>>> random and non-deterministic system hangs we've had over the past
>>>>>> couple of years.
>>>>>>
>>>>>> anyways, it's been corrected and the master seems to be humming along,
>>>>>> for real this time, w/o issue.  i'll continue to keep an eye on this
>>>>>> for the rest of the week, but things are looking MUCH better now.
>>>>>>
>>>>>> sorry again for the interruptions in service.
>>>>>>
>>>>>> shane
>>>>>>
>>>>>> On Wed, May 17, 2017 at 9:59 AM, shane knapp <[hidden email]> wrote:
>>>>>> > ok, we're back up, system load looks cromulent and we're happily
>>>>>> > building (again).
>>>>>> >
>>>>>> > shane
>>>>>> >
>>>>>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp <[hidden email]>
>>>>>> > wrote:
>>>>>> >> i'm going to need to perform a quick reboot on the jenkins master.  it
>>>>>> >> looks like it's hung again.
>>>>>> >>
>>>>>> >> sorry about this!
>>>>>> >>
>>>>>> >> shane
>>>>>> >>
>>>>>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp <[hidden email]>
>>>>>> >> wrote:
>>>>>> >>> ...but just now i started getting alerts on system load, which was
>>>>>> >>> rather high.  i had to kick jenkins again, and will keep an eye on the
>>>>>> >>> master and possible need to reboot.
>>>>>> >>>
>>>>>> >>> sorry about the interruption of service...
>>>>>> >>>
>>>>>> >>> shane
>>>>>> >>>
>>>>>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp <[hidden email]>
>>>>>> >>> wrote:
>>>>>> >>>> ...so i kicked it and it's now back up and happily building.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: [hidden email]
>>>>>>
>>>>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: [build system] jenkins got itself wedged...

shane knapp
yeah.  i noticed that and restarted it a few minutes ago.  i'll have
some time later this afternoon to take a closer look...   :\

On Sun, May 21, 2017 at 9:08 AM, Kazuaki Ishizaki <[hidden email]> wrote:

> It looked well these days. However, it seems to go down slowly again...
>
> When I tried to see console log (e.g.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull),
> a server returns "proxy error."
>
> Regards,
> Kazuaki Ishizaki
>
>
>
> From:        shane knapp <[hidden email]>
> To:        Sean Owen <[hidden email]>
> Cc:        dev <[hidden email]>
> Date:        2017/05/20 09:43
> Subject:        Re: [build system] jenkins got itself wedged...
> ________________________________
>
>
>
> last update of the week:
>
> things are looking great...  we're GCing happily and staying well
> within our memory limits.
>
> i'm going to do one more restart after the two pull request builds
> finish to re-enable backups, and call it a weekend.  :)
>
> shane
>
> On Fri, May 19, 2017 at 8:29 AM, shane knapp <[hidden email]> wrote:
>> this is hopefully my final email on the subject...   :)
>>
>> things have seemed to settled down after my GC tuning, and system
>> load/cpu usage/memory has been nice and flat all night.  i'll continue
>> to keep an eye on things but it looks like we've weathered the worst
>> part of the storm.
>>
>> On Thu, May 18, 2017 at 6:40 PM, shane knapp <[hidden email]> wrote:
>>> after needing another restart this afternoon, i did some homework and
>>> aggressively twiddled some GC settings[1].  since then, things have
>>> definitely smoothed out w/regards to memory and cpu usage spikes.
>>>
>>> i've attached a screenshot of slightly happier looking graphs.
>>>
>>> still keeping an eye on things, and hoping that i can go back to being
>>> a lurker...  ;)
>>>
>>> shane
>>>
>>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>
>>> On Thu, May 18, 2017 at 11:20 AM, shane knapp <[hidden email]>
>>> wrote:
>>>> ok, more updates:
>>>>
>>>> 1) i audited all of the builds, and found that the spark-*-compile-*
>>>> and spark-*-test-* jobs were set to the identical cron time trigger,
>>>> so josh rosen and i updated them to run at H/5 (instead of */5).  load
>>>> balancing ftw.
>>>>
>>>> 2) the jenkins master is now running on java8, which has moar bettar
>>>> GC management under the hood.
>>>>
>>>> i'll be keeping an eye on this today, and if we start seeing GC
>>>> overhead failures, i'll start doing more GC performance tuning.
>>>> thankfully, cloudbees has a relatively decent guide that i'll be
>>>> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>>
>>>> shane
>>>>
>>>> On Thu, May 18, 2017 at 8:39 AM, shane knapp <[hidden email]>
>>>> wrote:
>>>>> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
>>>>> getting some error messages in the logs...   looks like jenkins is
>>>>> thrashing on GC.
>>>>>
>>>>> now that i know what's up, i should be able to get this sorted today.
>>>>>
>>>>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <[hidden email]> wrote:
>>>>>> I'm not sure if it's related, but I still can't get Jenkins to test
>>>>>> PRs. For
>>>>>> example, triggering it through the spark-prs.appspot.com UI gives
>>>>>> me...
>>>>>>
>>>>>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>>>>>
>>>>>> Internal Server Error
>>>>>>
>>>>>> That might be from the appspot app though?
>>>>>>
>>>>>> But posting "Jenkins test this please" on PRs doesn't seem to work,
>>>>>> and I
>>>>>> can't reach Jenkins:
>>>>>>
>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>>>>>
>>>>>> On Thu, May 18, 2017 at 12:44 AM shane knapp <[hidden email]>
>>>>>> wrote:
>>>>>>>
>>>>>>> after another couple of restarts due to high load and system
>>>>>>> unresponsiveness, i finally found what is the most likely culprit:
>>>>>>>
>>>>>>> a typo in the jenkins config where the java heap size was configured.
>>>>>>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
>>>>>>> random and non-deterministic system hangs we've had over the past
>>>>>>> couple of years.
>>>>>>>
>>>>>>> anyways, it's been corrected and the master seems to be humming
>>>>>>> along,
>>>>>>> for real this time, w/o issue.  i'll continue to keep an eye on this
>>>>>>> for the rest of the week, but things are looking MUCH better now.
>>>>>>>
>>>>>>> sorry again for the interruptions in service.
>>>>>>>
>>>>>>> shane
>>>>>>>
>>>>>>> On Wed, May 17, 2017 at 9:59 AM, shane knapp <[hidden email]>
>>>>>>> wrote:
>>>>>>> > ok, we're back up, system load looks cromulent and we're happily
>>>>>>> > building (again).
>>>>>>> >
>>>>>>> > shane
>>>>>>> >
>>>>>>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp <[hidden email]>
>>>>>>> > wrote:
>>>>>>> >> i'm going to need to perform a quick reboot on the jenkins master.
>>>>>>> >> it
>>>>>>> >> looks like it's hung again.
>>>>>>> >>
>>>>>>> >> sorry about this!
>>>>>>> >>
>>>>>>> >> shane
>>>>>>> >>
>>>>>>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp
>>>>>>> >> <[hidden email]>
>>>>>>> >> wrote:
>>>>>>> >>> ...but just now i started getting alerts on system load, which
>>>>>>> >>> was
>>>>>>> >>> rather high.  i had to kick jenkins again, and will keep an eye
>>>>>>> >>> on the
>>>>>>> >>> master and possible need to reboot.
>>>>>>> >>>
>>>>>>> >>> sorry about the interruption of service...
>>>>>>> >>>
>>>>>>> >>> shane
>>>>>>> >>>
>>>>>>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp
>>>>>>> >>> <[hidden email]>
>>>>>>> >>> wrote:
>>>>>>> >>>> ...so i kicked it and it's now back up and happily building.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: [hidden email]
>>>>>>>
>>>>>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [build system] jenkins got itself wedged...

shane knapp
working on it.  we'll have intermittent downtime the next ~30 mins.

On Sun, May 21, 2017 at 12:01 PM, shane knapp <[hidden email]> wrote:

> yeah.  i noticed that and restarted it a few minutes ago.  i'll have
> some time later this afternoon to take a closer look...   :\
>
> On Sun, May 21, 2017 at 9:08 AM, Kazuaki Ishizaki <[hidden email]> wrote:
>> It looked well these days. However, it seems to go down slowly again...
>>
>> When I tried to see console log (e.g.
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull),
>> a server returns "proxy error."
>>
>> Regards,
>> Kazuaki Ishizaki
>>
>>
>>
>> From:        shane knapp <[hidden email]>
>> To:        Sean Owen <[hidden email]>
>> Cc:        dev <[hidden email]>
>> Date:        2017/05/20 09:43
>> Subject:        Re: [build system] jenkins got itself wedged...
>> ________________________________
>>
>>
>>
>> last update of the week:
>>
>> things are looking great...  we're GCing happily and staying well
>> within our memory limits.
>>
>> i'm going to do one more restart after the two pull request builds
>> finish to re-enable backups, and call it a weekend.  :)
>>
>> shane
>>
>> On Fri, May 19, 2017 at 8:29 AM, shane knapp <[hidden email]> wrote:
>>> this is hopefully my final email on the subject...   :)
>>>
>>> things have seemed to settled down after my GC tuning, and system
>>> load/cpu usage/memory has been nice and flat all night.  i'll continue
>>> to keep an eye on things but it looks like we've weathered the worst
>>> part of the storm.
>>>
>>> On Thu, May 18, 2017 at 6:40 PM, shane knapp <[hidden email]> wrote:
>>>> after needing another restart this afternoon, i did some homework and
>>>> aggressively twiddled some GC settings[1].  since then, things have
>>>> definitely smoothed out w/regards to memory and cpu usage spikes.
>>>>
>>>> i've attached a screenshot of slightly happier looking graphs.
>>>>
>>>> still keeping an eye on things, and hoping that i can go back to being
>>>> a lurker...  ;)
>>>>
>>>> shane
>>>>
>>>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>>
>>>> On Thu, May 18, 2017 at 11:20 AM, shane knapp <[hidden email]>
>>>> wrote:
>>>>> ok, more updates:
>>>>>
>>>>> 1) i audited all of the builds, and found that the spark-*-compile-*
>>>>> and spark-*-test-* jobs were set to the identical cron time trigger,
>>>>> so josh rosen and i updated them to run at H/5 (instead of */5).  load
>>>>> balancing ftw.
>>>>>
>>>>> 2) the jenkins master is now running on java8, which has moar bettar
>>>>> GC management under the hood.
>>>>>
>>>>> i'll be keeping an eye on this today, and if we start seeing GC
>>>>> overhead failures, i'll start doing more GC performance tuning.
>>>>> thankfully, cloudbees has a relatively decent guide that i'll be
>>>>> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>>>
>>>>> shane
>>>>>
>>>>> On Thu, May 18, 2017 at 8:39 AM, shane knapp <[hidden email]>
>>>>> wrote:
>>>>>> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
>>>>>> getting some error messages in the logs...   looks like jenkins is
>>>>>> thrashing on GC.
>>>>>>
>>>>>> now that i know what's up, i should be able to get this sorted today.
>>>>>>
>>>>>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <[hidden email]> wrote:
>>>>>>> I'm not sure if it's related, but I still can't get Jenkins to test
>>>>>>> PRs. For
>>>>>>> example, triggering it through the spark-prs.appspot.com UI gives
>>>>>>> me...
>>>>>>>
>>>>>>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>>>>>>
>>>>>>> Internal Server Error
>>>>>>>
>>>>>>> That might be from the appspot app though?
>>>>>>>
>>>>>>> But posting "Jenkins test this please" on PRs doesn't seem to work,
>>>>>>> and I
>>>>>>> can't reach Jenkins:
>>>>>>>
>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>>>>>>
>>>>>>> On Thu, May 18, 2017 at 12:44 AM shane knapp <[hidden email]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> after another couple of restarts due to high load and system
>>>>>>>> unresponsiveness, i finally found what is the most likely culprit:
>>>>>>>>
>>>>>>>> a typo in the jenkins config where the java heap size was configured.
>>>>>>>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
>>>>>>>> random and non-deterministic system hangs we've had over the past
>>>>>>>> couple of years.
>>>>>>>>
>>>>>>>> anyways, it's been corrected and the master seems to be humming
>>>>>>>> along,
>>>>>>>> for real this time, w/o issue.  i'll continue to keep an eye on this
>>>>>>>> for the rest of the week, but things are looking MUCH better now.
>>>>>>>>
>>>>>>>> sorry again for the interruptions in service.
>>>>>>>>
>>>>>>>> shane
>>>>>>>>
>>>>>>>> On Wed, May 17, 2017 at 9:59 AM, shane knapp <[hidden email]>
>>>>>>>> wrote:
>>>>>>>> > ok, we're back up, system load looks cromulent and we're happily
>>>>>>>> > building (again).
>>>>>>>> >
>>>>>>>> > shane
>>>>>>>> >
>>>>>>>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp <[hidden email]>
>>>>>>>> > wrote:
>>>>>>>> >> i'm going to need to perform a quick reboot on the jenkins master.
>>>>>>>> >> it
>>>>>>>> >> looks like it's hung again.
>>>>>>>> >>
>>>>>>>> >> sorry about this!
>>>>>>>> >>
>>>>>>>> >> shane
>>>>>>>> >>
>>>>>>>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp
>>>>>>>> >> <[hidden email]>
>>>>>>>> >> wrote:
>>>>>>>> >>> ...but just now i started getting alerts on system load, which
>>>>>>>> >>> was
>>>>>>>> >>> rather high.  i had to kick jenkins again, and will keep an eye
>>>>>>>> >>> on the
>>>>>>>> >>> master and possible need to reboot.
>>>>>>>> >>>
>>>>>>>> >>> sorry about the interruption of service...
>>>>>>>> >>>
>>>>>>>> >>> shane
>>>>>>>> >>>
>>>>>>>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp
>>>>>>>> >>> <[hidden email]>
>>>>>>>> >>> wrote:
>>>>>>>> >>>> ...so i kicked it and it's now back up and happily building.
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: [hidden email]
>>>>>>>>
>>>>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>>
>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]