[build system] IMPORTANT UPDATE

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[build system] IMPORTANT UPDATE

shane knapp ☠
this is a lengthy, but important read for everyone here.

in the next few days, the remaining centos machines (PRB/SBT workers AND primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.

this means three important things on the very near horizon: 
1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
2 -- jenkins itself will be down for a while as we move the jenkins installation to it's new home.
3 -- those of you with accounts here will temporarily lose access

regarding (1), brian (cced) will be helping me debug and fix any system-level bugs (python envs, missing packages, etc).  jon (cced) will be doing the reimaging and cobbling together of hardware to keep us on our feet.  their help is going to be invaluable to getting us back on the ground.

we already have two ubuntu 20 workers up and building (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build is already green.  i'll keep an eye on these workers to ensure i didn't miss anything.

once we have a couple of more ubuntu 20 machines up, i'll move the PRB and SBT builds there and let them fail as often as possible so we can use the build logs during the migration of the primary.

then we shut down jenkins and move to the new primary.

this will all be happening in the next week to week-and-a-half.

nearish on the horizon, we need to do two things:
1 -- reimage the ubuntu 16 workers
2 -- clean up the all of the breakages within jenkins plugin universe.  there's a lot of stacktraces everywhere after the upgrade, but things are still building so i'm inclined to push this out.
3 -- fix the PRB/SBT builds.

further off, once we're stable, we (the spark community) will need to have an honest conversation about where the build system lives.  we don't currently have enough resources here to manage the system in a way that it deserves, and i can't forsee getting the staffing for long-term support any time soon.

however, with the ansible configs (which i plan on moving to the spark repo), it should be much easier to replicate the build system.

by this time next year, i would like to have helped find the build system a new home, and sunset jenkins.  over the past 11 years (i think), this system has built spark.  it's getting a little tired and needs a well deserved break.  :)

shane
--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: [build system] IMPORTANT UPDATE

shane knapp ☠
due to scheduling, upcoming holiday and in-the-colo work requirements, all of the centos workers are being wiped NOW.

this is great, as the sooner we can get started on fixing builds the better.  i'm not going anywhere over the holiday, so i'll get a good head-start on things.

thank you jon!

shane

On Tue, Nov 24, 2020 at 11:24 AM shane knapp ☠ <[hidden email]> wrote:
this is a lengthy, but important read for everyone here.

in the next few days, the remaining centos machines (PRB/SBT workers AND primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.

this means three important things on the very near horizon: 
1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
2 -- jenkins itself will be down for a while as we move the jenkins installation to it's new home.
3 -- those of you with accounts here will temporarily lose access

regarding (1), brian (cced) will be helping me debug and fix any system-level bugs (python envs, missing packages, etc).  jon (cced) will be doing the reimaging and cobbling together of hardware to keep us on our feet.  their help is going to be invaluable to getting us back on the ground.

we already have two ubuntu 20 workers up and building (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build is already green.  i'll keep an eye on these workers to ensure i didn't miss anything.

once we have a couple of more ubuntu 20 machines up, i'll move the PRB and SBT builds there and let them fail as often as possible so we can use the build logs during the migration of the primary.

then we shut down jenkins and move to the new primary.

this will all be happening in the next week to week-and-a-half.

nearish on the horizon, we need to do two things:
1 -- reimage the ubuntu 16 workers
2 -- clean up the all of the breakages within jenkins plugin universe.  there's a lot of stacktraces everywhere after the upgrade, but things are still building so i'm inclined to push this out.
3 -- fix the PRB/SBT builds.

further off, once we're stable, we (the spark community) will need to have an honest conversation about where the build system lives.  we don't currently have enough resources here to manage the system in a way that it deserves, and i can't forsee getting the staffing for long-term support any time soon.

however, with the ansible configs (which i plan on moving to the spark repo), it should be much easier to replicate the build system.

by this time next year, i would like to have helped find the build system a new home, and sunset jenkins.  over the past 11 years (i think), this system has built spark.  it's getting a little tired and needs a well deserved break.  :)

shane
--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: [build system] IMPORTANT UPDATE

shane knapp ☠
our very first ubuntu-based PRB is running:

crossing my fingers!  :)

On Tue, Nov 24, 2020 at 1:30 PM shane knapp ☠ <[hidden email]> wrote:
due to scheduling, upcoming holiday and in-the-colo work requirements, all of the centos workers are being wiped NOW.

this is great, as the sooner we can get started on fixing builds the better.  i'm not going anywhere over the holiday, so i'll get a good head-start on things.

thank you jon!

shane

On Tue, Nov 24, 2020 at 11:24 AM shane knapp ☠ <[hidden email]> wrote:
this is a lengthy, but important read for everyone here.

in the next few days, the remaining centos machines (PRB/SBT workers AND primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.

this means three important things on the very near horizon: 
1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
2 -- jenkins itself will be down for a while as we move the jenkins installation to it's new home.
3 -- those of you with accounts here will temporarily lose access

regarding (1), brian (cced) will be helping me debug and fix any system-level bugs (python envs, missing packages, etc).  jon (cced) will be doing the reimaging and cobbling together of hardware to keep us on our feet.  their help is going to be invaluable to getting us back on the ground.

we already have two ubuntu 20 workers up and building (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build is already green.  i'll keep an eye on these workers to ensure i didn't miss anything.

once we have a couple of more ubuntu 20 machines up, i'll move the PRB and SBT builds there and let them fail as often as possible so we can use the build logs during the migration of the primary.

then we shut down jenkins and move to the new primary.

this will all be happening in the next week to week-and-a-half.

nearish on the horizon, we need to do two things:
1 -- reimage the ubuntu 16 workers
2 -- clean up the all of the breakages within jenkins plugin universe.  there's a lot of stacktraces everywhere after the upgrade, but things are still building so i'm inclined to push this out.
3 -- fix the PRB/SBT builds.

further off, once we're stable, we (the spark community) will need to have an honest conversation about where the build system lives.  we don't currently have enough resources here to manage the system in a way that it deserves, and i can't forsee getting the staffing for long-term support any time soon.

however, with the ansible configs (which i plan on moving to the spark repo), it should be much easier to replicate the build system.

by this time next year, i would like to have helped find the build system a new home, and sunset jenkins.  over the past 11 years (i think), this system has built spark.  it's getting a little tired and needs a well deserved break.  :)

shane
--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: [build system] IMPORTANT UPDATE

shane knapp ☠
all spark builds have been ported and triggered:

not shown are the regular and k8s PRB, which are also running.

i think i've nailed down most of the stupid PATH and JAVA_HOME issues, but i'm sure we'll have some stuff to work out.  i'm mostly keeping an eye on the build history of research-jenkins-worker-01 and -02, as they're running the latest OS + ansible (which will be moved in to the spark repo asap).

i'm still concerned about sbt failures, which includes the PRB.  we'll see how things go, and just focus on getting things working on ubuntu 20 LTS.  if we need to drop the ubuntu 16 workers from the pool temporarily, i would be more than happy to do that.  we'll lose some capacity, but it looks like we have a solid template for getting these suckers redeployed so turn-around should be pretty quick.

we also need to dedicate some time to clean up/fix our plugin configs.  there's been a lot of change over the past three years and things like PRB triggers seem flaky (it took 28m instead of 5m for this job to trigger:  https://github.com/apache/spark/pull/29994)

this all being said, i'm really happy w/our progress so far and have started leaning towards 'cautiously optimistic'...  we'll see how things go and recalibrate accordingly.  i'll have a better idea of where we are tomorrow and keep the list updated.

and finally:  a HUGE thanks goes out to jon for the work going on at the colo this moment:  rack rearrangement, cleaning up networking, fixing hardware, reimaging and generally kicking ass!

have a great holiday!

shane

On Tue, Nov 24, 2020 at 2:24 PM shane knapp ☠ <[hidden email]> wrote:
our very first ubuntu-based PRB is running:

crossing my fingers!  :)

On Tue, Nov 24, 2020 at 1:30 PM shane knapp ☠ <[hidden email]> wrote:
due to scheduling, upcoming holiday and in-the-colo work requirements, all of the centos workers are being wiped NOW.

this is great, as the sooner we can get started on fixing builds the better.  i'm not going anywhere over the holiday, so i'll get a good head-start on things.

thank you jon!

shane

On Tue, Nov 24, 2020 at 11:24 AM shane knapp ☠ <[hidden email]> wrote:
this is a lengthy, but important read for everyone here.

in the next few days, the remaining centos machines (PRB/SBT workers AND primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.

this means three important things on the very near horizon: 
1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
2 -- jenkins itself will be down for a while as we move the jenkins installation to it's new home.
3 -- those of you with accounts here will temporarily lose access

regarding (1), brian (cced) will be helping me debug and fix any system-level bugs (python envs, missing packages, etc).  jon (cced) will be doing the reimaging and cobbling together of hardware to keep us on our feet.  their help is going to be invaluable to getting us back on the ground.

we already have two ubuntu 20 workers up and building (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build is already green.  i'll keep an eye on these workers to ensure i didn't miss anything.

once we have a couple of more ubuntu 20 machines up, i'll move the PRB and SBT builds there and let them fail as often as possible so we can use the build logs during the migration of the primary.

then we shut down jenkins and move to the new primary.

this will all be happening in the next week to week-and-a-half.

nearish on the horizon, we need to do two things:
1 -- reimage the ubuntu 16 workers
2 -- clean up the all of the breakages within jenkins plugin universe.  there's a lot of stacktraces everywhere after the upgrade, but things are still building so i'm inclined to push this out.
3 -- fix the PRB/SBT builds.

further off, once we're stable, we (the spark community) will need to have an honest conversation about where the build system lives.  we don't currently have enough resources here to manage the system in a way that it deserves, and i can't forsee getting the staffing for long-term support any time soon.

however, with the ansible configs (which i plan on moving to the spark repo), it should be much easier to replicate the build system.

by this time next year, i would like to have helped find the build system a new home, and sunset jenkins.  over the past 11 years (i think), this system has built spark.  it's getting a little tired and needs a well deserved break.  :)

shane
--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: [build system] IMPORTANT UPDATE

shane knapp ☠
hey all, work is going quite well and smoothly for this project.

today's update:

we will experience significant downtime monday/tuesday as we spin up the new primary jenkins node.  until then, we'll be building over the next few days so i'll have a chance to better track down and fix any system-level build breaks.

but most importantly, i just added 3 of the 4 new ubuntu 20.04 workers to the pool:  research-jenkins-worker-03, 04 and 06.  -05 is being difficult, so i'm going to let it pout in the corner for a while before hitting it again w/the ansible cannon.

shane

On Tue, Nov 24, 2020 at 6:08 PM shane knapp ☠ <[hidden email]> wrote:
all spark builds have been ported and triggered:

not shown are the regular and k8s PRB, which are also running.

i think i've nailed down most of the stupid PATH and JAVA_HOME issues, but i'm sure we'll have some stuff to work out.  i'm mostly keeping an eye on the build history of research-jenkins-worker-01 and -02, as they're running the latest OS + ansible (which will be moved in to the spark repo asap).

i'm still concerned about sbt failures, which includes the PRB.  we'll see how things go, and just focus on getting things working on ubuntu 20 LTS.  if we need to drop the ubuntu 16 workers from the pool temporarily, i would be more than happy to do that.  we'll lose some capacity, but it looks like we have a solid template for getting these suckers redeployed so turn-around should be pretty quick.

we also need to dedicate some time to clean up/fix our plugin configs.  there's been a lot of change over the past three years and things like PRB triggers seem flaky (it took 28m instead of 5m for this job to trigger:  https://github.com/apache/spark/pull/29994)

this all being said, i'm really happy w/our progress so far and have started leaning towards 'cautiously optimistic'...  we'll see how things go and recalibrate accordingly.  i'll have a better idea of where we are tomorrow and keep the list updated.

and finally:  a HUGE thanks goes out to jon for the work going on at the colo this moment:  rack rearrangement, cleaning up networking, fixing hardware, reimaging and generally kicking ass!

have a great holiday!

shane

On Tue, Nov 24, 2020 at 2:24 PM shane knapp ☠ <[hidden email]> wrote:
our very first ubuntu-based PRB is running:

crossing my fingers!  :)

On Tue, Nov 24, 2020 at 1:30 PM shane knapp ☠ <[hidden email]> wrote:
due to scheduling, upcoming holiday and in-the-colo work requirements, all of the centos workers are being wiped NOW.

this is great, as the sooner we can get started on fixing builds the better.  i'm not going anywhere over the holiday, so i'll get a good head-start on things.

thank you jon!

shane

On Tue, Nov 24, 2020 at 11:24 AM shane knapp ☠ <[hidden email]> wrote:
this is a lengthy, but important read for everyone here.

in the next few days, the remaining centos machines (PRB/SBT workers AND primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.

this means three important things on the very near horizon: 
1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
2 -- jenkins itself will be down for a while as we move the jenkins installation to it's new home.
3 -- those of you with accounts here will temporarily lose access

regarding (1), brian (cced) will be helping me debug and fix any system-level bugs (python envs, missing packages, etc).  jon (cced) will be doing the reimaging and cobbling together of hardware to keep us on our feet.  their help is going to be invaluable to getting us back on the ground.

we already have two ubuntu 20 workers up and building (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build is already green.  i'll keep an eye on these workers to ensure i didn't miss anything.

once we have a couple of more ubuntu 20 machines up, i'll move the PRB and SBT builds there and let them fail as often as possible so we can use the build logs during the migration of the primary.

then we shut down jenkins and move to the new primary.

this will all be happening in the next week to week-and-a-half.

nearish on the horizon, we need to do two things:
1 -- reimage the ubuntu 16 workers
2 -- clean up the all of the breakages within jenkins plugin universe.  there's a lot of stacktraces everywhere after the upgrade, but things are still building so i'm inclined to push this out.
3 -- fix the PRB/SBT builds.

further off, once we're stable, we (the spark community) will need to have an honest conversation about where the build system lives.  we don't currently have enough resources here to manage the system in a way that it deserves, and i can't forsee getting the staffing for long-term support any time soon.

however, with the ansible configs (which i plan on moving to the spark repo), it should be much easier to replicate the build system.

by this time next year, i would like to have helped find the build system a new home, and sunset jenkins.  over the past 11 years (i think), this system has built spark.  it's getting a little tired and needs a well deserved break.  :)

shane
--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: [build system] IMPORTANT UPDATE

shane knapp ☠
alright, builds are looking solid except for SBT...  if someone here could take a look at those failures i'd be most appreciative.

the important ones:  PRB, PRB-K8s, k8s, snapshot and maven builds all green!  

i'm literally gobsmacked by how smoothly this went.  :)

we're all going to enjoy a mellow holiday and i'll check build statuses every now and then and see if i find anything else like this:

have a great holiday everyone!  we'll start getting the new primary set up on monday, and hopefully by tuesday be fully up and running.

shane


On Wed, Nov 25, 2020 at 1:35 PM shane knapp ☠ <[hidden email]> wrote:
hey all, work is going quite well and smoothly for this project.

today's update:

we will experience significant downtime monday/tuesday as we spin up the new primary jenkins node.  until then, we'll be building over the next few days so i'll have a chance to better track down and fix any system-level build breaks.

but most importantly, i just added 3 of the 4 new ubuntu 20.04 workers to the pool:  research-jenkins-worker-03, 04 and 06.  -05 is being difficult, so i'm going to let it pout in the corner for a while before hitting it again w/the ansible cannon.

shane

On Tue, Nov 24, 2020 at 6:08 PM shane knapp ☠ <[hidden email]> wrote:
all spark builds have been ported and triggered:

not shown are the regular and k8s PRB, which are also running.

i think i've nailed down most of the stupid PATH and JAVA_HOME issues, but i'm sure we'll have some stuff to work out.  i'm mostly keeping an eye on the build history of research-jenkins-worker-01 and -02, as they're running the latest OS + ansible (which will be moved in to the spark repo asap).

i'm still concerned about sbt failures, which includes the PRB.  we'll see how things go, and just focus on getting things working on ubuntu 20 LTS.  if we need to drop the ubuntu 16 workers from the pool temporarily, i would be more than happy to do that.  we'll lose some capacity, but it looks like we have a solid template for getting these suckers redeployed so turn-around should be pretty quick.

we also need to dedicate some time to clean up/fix our plugin configs.  there's been a lot of change over the past three years and things like PRB triggers seem flaky (it took 28m instead of 5m for this job to trigger:  https://github.com/apache/spark/pull/29994)

this all being said, i'm really happy w/our progress so far and have started leaning towards 'cautiously optimistic'...  we'll see how things go and recalibrate accordingly.  i'll have a better idea of where we are tomorrow and keep the list updated.

and finally:  a HUGE thanks goes out to jon for the work going on at the colo this moment:  rack rearrangement, cleaning up networking, fixing hardware, reimaging and generally kicking ass!

have a great holiday!

shane

On Tue, Nov 24, 2020 at 2:24 PM shane knapp ☠ <[hidden email]> wrote:
our very first ubuntu-based PRB is running:

crossing my fingers!  :)

On Tue, Nov 24, 2020 at 1:30 PM shane knapp ☠ <[hidden email]> wrote:
due to scheduling, upcoming holiday and in-the-colo work requirements, all of the centos workers are being wiped NOW.

this is great, as the sooner we can get started on fixing builds the better.  i'm not going anywhere over the holiday, so i'll get a good head-start on things.

thank you jon!

shane

On Tue, Nov 24, 2020 at 11:24 AM shane knapp ☠ <[hidden email]> wrote:
this is a lengthy, but important read for everyone here.

in the next few days, the remaining centos machines (PRB/SBT workers AND primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.

this means three important things on the very near horizon: 
1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
2 -- jenkins itself will be down for a while as we move the jenkins installation to it's new home.
3 -- those of you with accounts here will temporarily lose access

regarding (1), brian (cced) will be helping me debug and fix any system-level bugs (python envs, missing packages, etc).  jon (cced) will be doing the reimaging and cobbling together of hardware to keep us on our feet.  their help is going to be invaluable to getting us back on the ground.

we already have two ubuntu 20 workers up and building (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build is already green.  i'll keep an eye on these workers to ensure i didn't miss anything.

once we have a couple of more ubuntu 20 machines up, i'll move the PRB and SBT builds there and let them fail as often as possible so we can use the build logs during the migration of the primary.

then we shut down jenkins and move to the new primary.

this will all be happening in the next week to week-and-a-half.

nearish on the horizon, we need to do two things:
1 -- reimage the ubuntu 16 workers
2 -- clean up the all of the breakages within jenkins plugin universe.  there's a lot of stacktraces everywhere after the upgrade, but things are still building so i'm inclined to push this out.
3 -- fix the PRB/SBT builds.

further off, once we're stable, we (the spark community) will need to have an honest conversation about where the build system lives.  we don't currently have enough resources here to manage the system in a way that it deserves, and i can't forsee getting the staffing for long-term support any time soon.

however, with the ansible configs (which i plan on moving to the spark repo), it should be much easier to replicate the build system.

by this time next year, i would like to have helped find the build system a new home, and sunset jenkins.  over the past 11 years (i think), this system has built spark.  it's getting a little tired and needs a well deserved break.  :)

shane
--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: [build system] IMPORTANT UPDATE

Hyukjin Kwon
Thanks Shane.

On Thu, 26 Nov 2020, 10:19 shane knapp ☠, <[hidden email]> wrote:
alright, builds are looking solid except for SBT...  if someone here could take a look at those failures i'd be most appreciative.

the important ones:  PRB, PRB-K8s, k8s, snapshot and maven builds all green!  

i'm literally gobsmacked by how smoothly this went.  :)

we're all going to enjoy a mellow holiday and i'll check build statuses every now and then and see if i find anything else like this:

have a great holiday everyone!  we'll start getting the new primary set up on monday, and hopefully by tuesday be fully up and running.

shane


On Wed, Nov 25, 2020 at 1:35 PM shane knapp ☠ <[hidden email]> wrote:
hey all, work is going quite well and smoothly for this project.

today's update:

we will experience significant downtime monday/tuesday as we spin up the new primary jenkins node.  until then, we'll be building over the next few days so i'll have a chance to better track down and fix any system-level build breaks.

but most importantly, i just added 3 of the 4 new ubuntu 20.04 workers to the pool:  research-jenkins-worker-03, 04 and 06.  -05 is being difficult, so i'm going to let it pout in the corner for a while before hitting it again w/the ansible cannon.

shane

On Tue, Nov 24, 2020 at 6:08 PM shane knapp ☠ <[hidden email]> wrote:
all spark builds have been ported and triggered:

not shown are the regular and k8s PRB, which are also running.

i think i've nailed down most of the stupid PATH and JAVA_HOME issues, but i'm sure we'll have some stuff to work out.  i'm mostly keeping an eye on the build history of research-jenkins-worker-01 and -02, as they're running the latest OS + ansible (which will be moved in to the spark repo asap).

i'm still concerned about sbt failures, which includes the PRB.  we'll see how things go, and just focus on getting things working on ubuntu 20 LTS.  if we need to drop the ubuntu 16 workers from the pool temporarily, i would be more than happy to do that.  we'll lose some capacity, but it looks like we have a solid template for getting these suckers redeployed so turn-around should be pretty quick.

we also need to dedicate some time to clean up/fix our plugin configs.  there's been a lot of change over the past three years and things like PRB triggers seem flaky (it took 28m instead of 5m for this job to trigger:  https://github.com/apache/spark/pull/29994)

this all being said, i'm really happy w/our progress so far and have started leaning towards 'cautiously optimistic'...  we'll see how things go and recalibrate accordingly.  i'll have a better idea of where we are tomorrow and keep the list updated.

and finally:  a HUGE thanks goes out to jon for the work going on at the colo this moment:  rack rearrangement, cleaning up networking, fixing hardware, reimaging and generally kicking ass!

have a great holiday!

shane

On Tue, Nov 24, 2020 at 2:24 PM shane knapp ☠ <[hidden email]> wrote:
our very first ubuntu-based PRB is running:

crossing my fingers!  :)

On Tue, Nov 24, 2020 at 1:30 PM shane knapp ☠ <[hidden email]> wrote:
due to scheduling, upcoming holiday and in-the-colo work requirements, all of the centos workers are being wiped NOW.

this is great, as the sooner we can get started on fixing builds the better.  i'm not going anywhere over the holiday, so i'll get a good head-start on things.

thank you jon!

shane

On Tue, Nov 24, 2020 at 11:24 AM shane knapp ☠ <[hidden email]> wrote:
this is a lengthy, but important read for everyone here.

in the next few days, the remaining centos machines (PRB/SBT workers AND primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.

this means three important things on the very near horizon: 
1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
2 -- jenkins itself will be down for a while as we move the jenkins installation to it's new home.
3 -- those of you with accounts here will temporarily lose access

regarding (1), brian (cced) will be helping me debug and fix any system-level bugs (python envs, missing packages, etc).  jon (cced) will be doing the reimaging and cobbling together of hardware to keep us on our feet.  their help is going to be invaluable to getting us back on the ground.

we already have two ubuntu 20 workers up and building (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build is already green.  i'll keep an eye on these workers to ensure i didn't miss anything.

once we have a couple of more ubuntu 20 machines up, i'll move the PRB and SBT builds there and let them fail as often as possible so we can use the build logs during the migration of the primary.

then we shut down jenkins and move to the new primary.

this will all be happening in the next week to week-and-a-half.

nearish on the horizon, we need to do two things:
1 -- reimage the ubuntu 16 workers
2 -- clean up the all of the breakages within jenkins plugin universe.  there's a lot of stacktraces everywhere after the upgrade, but things are still building so i'm inclined to push this out.
3 -- fix the PRB/SBT builds.

further off, once we're stable, we (the spark community) will need to have an honest conversation about where the build system lives.  we don't currently have enough resources here to manage the system in a way that it deserves, and i can't forsee getting the staffing for long-term support any time soon.

however, with the ansible configs (which i plan on moving to the spark repo), it should be much easier to replicate the build system.

by this time next year, i would like to have helped find the build system a new home, and sunset jenkins.  over the past 11 years (i think), this system has built spark.  it's getting a little tired and needs a well deserved break.  :)

shane
--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu