restarting jenkins build system tomorrow (7/8) ~930am PDT

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

restarting jenkins build system tomorrow (7/8) ~930am PDT

shane knapp ☠
i wasn't able to get to it today, so i'm hoping to squeeze in a quick trip to the colo tomorrow morning.  if not, then first thing thursday.

--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

shane knapp ☠
this will be happening tomorrow...  today is Meeting Hell Day[tm].

On Tue, Jul 7, 2020 at 1:59 PM shane knapp ☠ <[hidden email]> wrote:
i wasn't able to get to it today, so i'm hoping to squeeze in a quick trip to the colo tomorrow morning.  if not, then first thing thursday.

--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

Hyukjin Kwon
Thanks Shane!

BTW, it's getting serious .. e.g) https://github.com/apache/spark/pull/28969.
The tests could not pass in 7 days .. Hopefully restarting the machines will make the current situation better :-)

Separately, I am working on a PR to run the Spark tests in Github Actions.
We could hopefully use Github Actions and Jenkins together meanwhile.


2020년 7월 9일 (목) 오전 1:07, shane knapp ☠ <[hidden email]>님이 작성:
this will be happening tomorrow...  today is Meeting Hell Day[tm].

On Tue, Jul 7, 2020 at 1:59 PM shane knapp ☠ <[hidden email]> wrote:
i wasn't able to get to it today, so i'm hoping to squeeze in a quick trip to the colo tomorrow morning.  if not, then first thing thursday.

--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

Jungtaek Lim-2
As a side note, I've raised patches for addressing two frequent flaky tests, CliSuite [1] and HiveSessionImplSuite [2]. Hope this helps to mitigate the situation.


On Thu, Jul 9, 2020 at 11:51 AM Hyukjin Kwon <[hidden email]> wrote:
Thanks Shane!

BTW, it's getting serious .. e.g) https://github.com/apache/spark/pull/28969.
The tests could not pass in 7 days .. Hopefully restarting the machines will make the current situation better :-)

Separately, I am working on a PR to run the Spark tests in Github Actions.
We could hopefully use Github Actions and Jenkins together meanwhile.


2020년 7월 9일 (목) 오전 1:07, shane knapp ☠ <[hidden email]>님이 작성:
this will be happening tomorrow...  today is Meeting Hell Day[tm].

On Tue, Jul 7, 2020 at 1:59 PM shane knapp ☠ <[hidden email]> wrote:
i wasn't able to get to it today, so i'm hoping to squeeze in a quick trip to the colo tomorrow morning.  if not, then first thing thursday.

--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

shane knapp ☠
In reply to this post by shane knapp ☠
this is happening now.

On Wed, Jul 8, 2020 at 9:07 AM shane knapp ☠ <[hidden email]> wrote:
this will be happening tomorrow...  today is Meeting Hell Day[tm].

On Tue, Jul 7, 2020 at 1:59 PM shane knapp ☠ <[hidden email]> wrote:
i wasn't able to get to it today, so i'm hoping to squeeze in a quick trip to the colo tomorrow morning.  if not, then first thing thursday.

--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

Dongjoon Hyun-2
Thank you always, Shane!

Bests,
Dongjoon.

On Thu, Jul 9, 2020 at 9:30 AM shane knapp ☠ <[hidden email]> wrote:
this is happening now.

On Wed, Jul 8, 2020 at 9:07 AM shane knapp ☠ <[hidden email]> wrote:
this will be happening tomorrow...  today is Meeting Hell Day[tm].

On Tue, Jul 7, 2020 at 1:59 PM shane knapp ☠ <[hidden email]> wrote:
i wasn't able to get to it today, so i'm hoping to squeeze in a quick trip to the colo tomorrow morning.  if not, then first thing thursday.

--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

shane knapp ☠
In reply to this post by shane knapp ☠
ok, we're back up and building (just waiting for one worker, -06 to finish cleaning itself up).

On Thu, Jul 9, 2020 at 9:30 AM shane knapp ☠ <[hidden email]> wrote:
this is happening now.

On Wed, Jul 8, 2020 at 9:07 AM shane knapp ☠ <[hidden email]> wrote:
this will be happening tomorrow...  today is Meeting Hell Day[tm].

On Tue, Jul 7, 2020 at 1:59 PM shane knapp ☠ <[hidden email]> wrote:
i wasn't able to get to it today, so i'm hoping to squeeze in a quick trip to the colo tomorrow morning.  if not, then first thing thursday.

--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

shane knapp ☠
aaaaand -06 is back!  i'll keep an eye on things today, but suffice to say on each worker i:

1) rebooted
2) cleaned ~/.ivy2, ~/.m2, and other associated caches

we should be g2g!  please reply here if you continue to see weirdness.

On Thu, Jul 9, 2020 at 10:08 AM shane knapp ☠ <[hidden email]> wrote:
ok, we're back up and building (just waiting for one worker, -06 to finish cleaning itself up).

On Thu, Jul 9, 2020 at 9:30 AM shane knapp ☠ <[hidden email]> wrote:
this is happening now.

On Wed, Jul 8, 2020 at 9:07 AM shane knapp ☠ <[hidden email]> wrote:
this will be happening tomorrow...  today is Meeting Hell Day[tm].

On Tue, Jul 7, 2020 at 1:59 PM shane knapp ☠ <[hidden email]> wrote:
i wasn't able to get to it today, so i'm hoping to squeeze in a quick trip to the colo tomorrow morning.  if not, then first thing thursday.

--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

Hyukjin Kwon
Thank you Shane.

2020년 7월 10일 (금) 오전 2:35, shane knapp ☠ <[hidden email]>님이 작성:
aaaaand -06 is back!  i'll keep an eye on things today, but suffice to say on each worker i:

1) rebooted
2) cleaned ~/.ivy2, ~/.m2, and other associated caches

we should be g2g!  please reply here if you continue to see weirdness.

On Thu, Jul 9, 2020 at 10:08 AM shane knapp ☠ <[hidden email]> wrote:
ok, we're back up and building (just waiting for one worker, -06 to finish cleaning itself up).

On Thu, Jul 9, 2020 at 9:30 AM shane knapp ☠ <[hidden email]> wrote:
this is happening now.

On Wed, Jul 8, 2020 at 9:07 AM shane knapp ☠ <[hidden email]> wrote:
this will be happening tomorrow...  today is Meeting Hell Day[tm].

On Tue, Jul 7, 2020 at 1:59 PM shane knapp ☠ <[hidden email]> wrote:
i wasn't able to get to it today, so i'm hoping to squeeze in a quick trip to the colo tomorrow morning.  if not, then first thing thursday.

--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

shane knapp ☠
i'm seeing green PRB builds now, so i feel that we've gotten things building again!  :)

On Thu, Jul 9, 2020 at 5:33 PM Hyukjin Kwon <[hidden email]> wrote:
Thank you Shane.

2020년 7월 10일 (금) 오전 2:35, shane knapp ☠ <[hidden email]>님이 작성:
aaaaand -06 is back!  i'll keep an eye on things today, but suffice to say on each worker i:

1) rebooted
2) cleaned ~/.ivy2, ~/.m2, and other associated caches

we should be g2g!  please reply here if you continue to see weirdness.

On Thu, Jul 9, 2020 at 10:08 AM shane knapp ☠ <[hidden email]> wrote:
ok, we're back up and building (just waiting for one worker, -06 to finish cleaning itself up).

On Thu, Jul 9, 2020 at 9:30 AM shane knapp ☠ <[hidden email]> wrote:
this is happening now.

On Wed, Jul 8, 2020 at 9:07 AM shane knapp ☠ <[hidden email]> wrote:
this will be happening tomorrow...  today is Meeting Hell Day[tm].

On Tue, Jul 7, 2020 at 1:59 PM shane knapp ☠ <[hidden email]> wrote:
i wasn't able to get to it today, so i'm hoping to squeeze in a quick trip to the colo tomorrow morning.  if not, then first thing thursday.

--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

ukby1234
Looks like Jenkins isn't stable still. My PR fails two times in a row:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

Hyukjin Kwon
Couple of flaky tests can happen. It's usual. Seems it got better now at least. I will keep monitoring the builds.

2020년 7월 10일 (금) 오후 4:33, ukby1234 <[hidden email]>님이 작성:
Looks like Jenkins isn't stable still. My PR fails two times in a row:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

shane knapp ☠
yeah, i can't do much for flaky tests...  just flaky infrastructure.


On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon <[hidden email]> wrote:
Couple of flaky tests can happen. It's usual. Seems it got better now at least. I will keep monitoring the builds.

2020년 7월 10일 (금) 오후 4:33, ukby1234 <[hidden email]>님이 작성:
Looks like Jenkins isn't stable still. My PR fails two times in a row:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

shane knapp ☠
define "a lot" and provide some links to those builds, please.  there are roughly 2000 builds per day, and i can't do more than keep a cursory eye on things.

the infrastructure that the tests run on hasn't changed one bit on any of the workers, and 'kill -9' could be a timeout, flakiness caused by old build processes remaining on the workers after the master went down, or me trying to clean things up w/o a reboot.  or, perhaps, something wrong w/the infra.  :)  

On Fri, Jul 10, 2020 at 9:28 AM Frank Yin <[hidden email]> wrote:
Agree, but I’ve seen a lot of kill by signal 9, assuming that infrastructure? 

On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ <[hidden email]> wrote:
yeah, i can't do much for flaky tests...  just flaky infrastructure.


On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon <[hidden email]> wrote:
Couple of flaky tests can happen. It's usual. Seems it got better now at least. I will keep monitoring the builds.

2020년 7월 10일 (금) 오후 4:33, ukby1234 <[hidden email]>님이 작성:
Looks like Jenkins isn't stable still. My PR fails two times in a row:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

shane knapp ☠
only 125561, 125562 and 125564 were impacted by -9.

125565 exited w/a code of 15 (143 - 128), which means the process was terminated for unknown reasons.

125563 looks like mima failed due to a bunch of errors.

i just spot checked a bunch of recent failed PRB builds from today and they all seemed to be legit.

another thing that might be happening is an overload of PRB builds on the workers due to the backlog...  the workers are under a LOT of load right now, and i can put some rate limiting in to see if that helps out.

shane

On Fri, Jul 10, 2020 at 11:31 AM Frank Yin <[hidden email]> wrote:

On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ <[hidden email]> wrote:
define "a lot" and provide some links to those builds, please.  there are roughly 2000 builds per day, and i can't do more than keep a cursory eye on things.

the infrastructure that the tests run on hasn't changed one bit on any of the workers, and 'kill -9' could be a timeout, flakiness caused by old build processes remaining on the workers after the master went down, or me trying to clean things up w/o a reboot.  or, perhaps, something wrong w/the infra.  :)  

On Fri, Jul 10, 2020 at 9:28 AM Frank Yin <[hidden email]> wrote:
Agree, but I’ve seen a lot of kill by signal 9, assuming that infrastructure? 

On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ <[hidden email]> wrote:
yeah, i can't do much for flaky tests...  just flaky infrastructure.


On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon <[hidden email]> wrote:
Couple of flaky tests can happen. It's usual. Seems it got better now at least. I will keep monitoring the builds.

2020년 7월 10일 (금) 오후 4:33, ukby1234 <[hidden email]>님이 작성:
Looks like Jenkins isn't stable still. My PR fails two times in a row:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

ukby1234
Yeah, that's what I figured -- those workers are under load. Thanks. 

On Fri, Jul 10, 2020 at 12:43 PM shane knapp ☠ <[hidden email]> wrote:
only 125561, 125562 and 125564 were impacted by -9.

125565 exited w/a code of 15 (143 - 128), which means the process was terminated for unknown reasons.

125563 looks like mima failed due to a bunch of errors.

i just spot checked a bunch of recent failed PRB builds from today and they all seemed to be legit.

another thing that might be happening is an overload of PRB builds on the workers due to the backlog...  the workers are under a LOT of load right now, and i can put some rate limiting in to see if that helps out.

shane

On Fri, Jul 10, 2020 at 11:31 AM Frank Yin <[hidden email]> wrote:

On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ <[hidden email]> wrote:
define "a lot" and provide some links to those builds, please.  there are roughly 2000 builds per day, and i can't do more than keep a cursory eye on things.

the infrastructure that the tests run on hasn't changed one bit on any of the workers, and 'kill -9' could be a timeout, flakiness caused by old build processes remaining on the workers after the master went down, or me trying to clean things up w/o a reboot.  or, perhaps, something wrong w/the infra.  :)  

On Fri, Jul 10, 2020 at 9:28 AM Frank Yin <[hidden email]> wrote:
Agree, but I’ve seen a lot of kill by signal 9, assuming that infrastructure? 

On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ <[hidden email]> wrote:
yeah, i can't do much for flaky tests...  just flaky infrastructure.


On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon <[hidden email]> wrote:
Couple of flaky tests can happen. It's usual. Seems it got better now at least. I will keep monitoring the builds.

2020년 7월 10일 (금) 오후 4:33, ukby1234 <[hidden email]>님이 작성:
Looks like Jenkins isn't stable still. My PR fails two times in a row:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

shane knapp ☠
no, 8 hours is plenty.  things will speed up soon once the backlog of builds works through....  i limited the number of PRB builds to 4 per worker, and things are looking better.  let's see how we look next week.

On Fri, Jul 10, 2020 at 3:31 PM Frank Yin <[hidden email]> wrote:
Can we also increase the build timeout? 
This one fails because it times out, not because of test failures. 

On Fri, Jul 10, 2020 at 2:16 PM Frank Yin <[hidden email]> wrote:
Yeah, that's what I figured -- those workers are under load. Thanks. 

On Fri, Jul 10, 2020 at 12:43 PM shane knapp ☠ <[hidden email]> wrote:
only 125561, 125562 and 125564 were impacted by -9.

125565 exited w/a code of 15 (143 - 128), which means the process was terminated for unknown reasons.

125563 looks like mima failed due to a bunch of errors.

i just spot checked a bunch of recent failed PRB builds from today and they all seemed to be legit.

another thing that might be happening is an overload of PRB builds on the workers due to the backlog...  the workers are under a LOT of load right now, and i can put some rate limiting in to see if that helps out.

shane

On Fri, Jul 10, 2020 at 11:31 AM Frank Yin <[hidden email]> wrote:

On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ <[hidden email]> wrote:
define "a lot" and provide some links to those builds, please.  there are roughly 2000 builds per day, and i can't do more than keep a cursory eye on things.

the infrastructure that the tests run on hasn't changed one bit on any of the workers, and 'kill -9' could be a timeout, flakiness caused by old build processes remaining on the workers after the master went down, or me trying to clean things up w/o a reboot.  or, perhaps, something wrong w/the infra.  :)  

On Fri, Jul 10, 2020 at 9:28 AM Frank Yin <[hidden email]> wrote:
Agree, but I’ve seen a lot of kill by signal 9, assuming that infrastructure? 

On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ <[hidden email]> wrote:
yeah, i can't do much for flaky tests...  just flaky infrastructure.


On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon <[hidden email]> wrote:
Couple of flaky tests can happen. It's usual. Seems it got better now at least. I will keep monitoring the builds.

2020년 7월 10일 (금) 오후 4:33, ukby1234 <[hidden email]>님이 작성:
Looks like Jenkins isn't stable still. My PR fails two times in a row:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

shane knapp ☠
alright, the system load graphs show that we've had a generally decreasing load since friday, and have burned through ~3k builds/day since the reboot last week!  i don't see many timeouts, and the PRB builds have been generally green for a couple of days.

again, i will keep an eye on things but i feel we're out of the woods right now.  :)

shane

On Fri, Jul 10, 2020 at 3:43 PM Frank Yin <[hidden email]> wrote:
Great. Thanks. 

On Fri, Jul 10, 2020 at 3:39 PM shane knapp ☠ <[hidden email]> wrote:
no, 8 hours is plenty.  things will speed up soon once the backlog of builds works through....  i limited the number of PRB builds to 4 per worker, and things are looking better.  let's see how we look next week.

On Fri, Jul 10, 2020 at 3:31 PM Frank Yin <[hidden email]> wrote:
Can we also increase the build timeout? 
This one fails because it times out, not because of test failures. 

On Fri, Jul 10, 2020 at 2:16 PM Frank Yin <[hidden email]> wrote:
Yeah, that's what I figured -- those workers are under load. Thanks. 

On Fri, Jul 10, 2020 at 12:43 PM shane knapp ☠ <[hidden email]> wrote:
only 125561, 125562 and 125564 were impacted by -9.

125565 exited w/a code of 15 (143 - 128), which means the process was terminated for unknown reasons.

125563 looks like mima failed due to a bunch of errors.

i just spot checked a bunch of recent failed PRB builds from today and they all seemed to be legit.

another thing that might be happening is an overload of PRB builds on the workers due to the backlog...  the workers are under a LOT of load right now, and i can put some rate limiting in to see if that helps out.

shane

On Fri, Jul 10, 2020 at 11:31 AM Frank Yin <[hidden email]> wrote:

On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ <[hidden email]> wrote:
define "a lot" and provide some links to those builds, please.  there are roughly 2000 builds per day, and i can't do more than keep a cursory eye on things.

the infrastructure that the tests run on hasn't changed one bit on any of the workers, and 'kill -9' could be a timeout, flakiness caused by old build processes remaining on the workers after the master went down, or me trying to clean things up w/o a reboot.  or, perhaps, something wrong w/the infra.  :)  

On Fri, Jul 10, 2020 at 9:28 AM Frank Yin <[hidden email]> wrote:
Agree, but I’ve seen a lot of kill by signal 9, assuming that infrastructure? 

On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ <[hidden email]> wrote:
yeah, i can't do much for flaky tests...  just flaky infrastructure.


On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon <[hidden email]> wrote:
Couple of flaky tests can happen. It's usual. Seems it got better now at least. I will keep monitoring the builds.

2020년 7월 10일 (금) 오후 4:33, ukby1234 <[hidden email]>님이 작성:
Looks like Jenkins isn't stable still. My PR fails two times in a row:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

Xiao Li-2
Thank you very much, Shane! 

Xiao

On Mon, Jul 13, 2020 at 10:15 AM shane knapp ☠ <[hidden email]> wrote:
alright, the system load graphs show that we've had a generally decreasing load since friday, and have burned through ~3k builds/day since the reboot last week!  i don't see many timeouts, and the PRB builds have been generally green for a couple of days.

again, i will keep an eye on things but i feel we're out of the woods right now.  :)

shane

On Fri, Jul 10, 2020 at 3:43 PM Frank Yin <[hidden email]> wrote:
Great. Thanks. 

On Fri, Jul 10, 2020 at 3:39 PM shane knapp ☠ <[hidden email]> wrote:
no, 8 hours is plenty.  things will speed up soon once the backlog of builds works through....  i limited the number of PRB builds to 4 per worker, and things are looking better.  let's see how we look next week.

On Fri, Jul 10, 2020 at 3:31 PM Frank Yin <[hidden email]> wrote:
Can we also increase the build timeout? 
This one fails because it times out, not because of test failures. 

On Fri, Jul 10, 2020 at 2:16 PM Frank Yin <[hidden email]> wrote:
Yeah, that's what I figured -- those workers are under load. Thanks. 

On Fri, Jul 10, 2020 at 12:43 PM shane knapp ☠ <[hidden email]> wrote:
only 125561, 125562 and 125564 were impacted by -9.

125565 exited w/a code of 15 (143 - 128), which means the process was terminated for unknown reasons.

125563 looks like mima failed due to a bunch of errors.

i just spot checked a bunch of recent failed PRB builds from today and they all seemed to be legit.

another thing that might be happening is an overload of PRB builds on the workers due to the backlog...  the workers are under a LOT of load right now, and i can put some rate limiting in to see if that helps out.

shane

On Fri, Jul 10, 2020 at 11:31 AM Frank Yin <[hidden email]> wrote:

On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ <[hidden email]> wrote:
define "a lot" and provide some links to those builds, please.  there are roughly 2000 builds per day, and i can't do more than keep a cursory eye on things.

the infrastructure that the tests run on hasn't changed one bit on any of the workers, and 'kill -9' could be a timeout, flakiness caused by old build processes remaining on the workers after the master went down, or me trying to clean things up w/o a reboot.  or, perhaps, something wrong w/the infra.  :)  

On Fri, Jul 10, 2020 at 9:28 AM Frank Yin <[hidden email]> wrote:
Agree, but I’ve seen a lot of kill by signal 9, assuming that infrastructure? 

On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ <[hidden email]> wrote:
yeah, i can't do much for flaky tests...  just flaky infrastructure.


On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon <[hidden email]> wrote:
Couple of flaky tests can happen. It's usual. Seems it got better now at least. I will keep monitoring the builds.

2020년 7월 10일 (금) 오후 4:33, ukby1234 <[hidden email]>님이 작성:
Looks like Jenkins isn't stable still. My PR fails two times in a row:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--