[build system] IMPORTANT: builds will be impacted this month

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[build system] IMPORTANT: builds will be impacted this month

shane knapp ☠
TL;DR:  our build system is ancient, EOLed and about to get hit hard w/a secops hammer.  we need to literally reinstall the entire cluster from scratch and get things working.

here are the high level bullet points about what's coming up in the next month:

** all amp-jenkins-worker-* nodes are running centos 6, and the remainder ubuntu 16.  these will be upgraded to ubuntu 20.

i will be doing this in stages so as to minimize downtime.

** ALL BUILDS NEED TO BE PORTED TO UBUNTU 20.  i can ensure that the environments on the nodes are identical, but i have yet been able to successfully build any SBT jobs on any version of ubuntu, and the MVN builds won't run on ubuntu 18 (tho they work fine on 16).  i also have had difficulty getting the PRB job to successfully finish on ubuntu.

for this, i will definitely need help from the dev community to get things working...  and the speed at which things are fixed will be inversely proportional to how much help i get.  :)

** amplab jenkins primary node will need two major upgrades:  OS from centos 6 to ubuntu 20, and jenkins from 1.6 to 2.X LTS... 

i'm most concerned about this, as it is literally the exact same jenkins installtion that patrick wendell set up over 10 years ago.  there are many publish secrets that are entered in to the jenkins config and i'd really hope that we don't lose them.

my plan here is to upgrade the current jenkins, and fix any things that break.  then we'll rsync jenkins' homedir to the new primary node and hope that works.  :)

** user audits

UC berkeley's new security standards require quarterly audits of non-affiliated accounts...  this won't impact only but a few people on this list, but i'll need to work w/campus and our department on solutions for this other than local accounts on the servers.

a LOT is going to happen, and i'm meeting w/my team today and will come up w/a basic plan.  we will definitely experience downtime during this, but i cannot guess as to what that will look like.

this might also be a good time to talk about the future of the build system, auditing our builds (do we need SBT?), or even finally getting around to dockerizing everything  so i don't need such a fragile and non-atomic set of worker nodes specifically for spark.

thoughts?  comments?

shane
 
ps -- this is one of the reasons why i haven't been around much lately...  it's been really tough keeping things up to date while trying to remotely train up one of my sysadmins to take over some of my build system duties.
--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: [build system] IMPORTANT: builds will be impacted this month

Dongjoon Hyun-2
Thank you for sharing the plan, Shane. It's great!

I'll participate actively if there is any issue during migration.

To share the current status on new target OS (Ubuntu 20), Apache Spark master branch successfully migrated to Ubuntu 20.04 in the GitHub Action environment first.

    [SPARK-33156][INFRA] Upgrade GithubAction image from 18.04 to 20.04
    [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs
    [SPARK-33239][INFRA] Use pre-built image at GitHub Action SparkR job

For PySpark/SparkR testing, we may be able to take advantage of the pre-built image in Jenkins environment too because it will remove all installation flakiness issues by isolating them from testing.

Also, in order to prepare the migration of `branch-3.0/branch-2.4`, we can backport the above patches (SPARK-33156/SPARK-33162/SPARK-33239) to `branch-3.0/branch-2.4`.

Bests,
Dongjoon.


On Mon, Nov 2, 2020 at 1:16 PM shane knapp ☠ <[hidden email]> wrote:
TL;DR:  our build system is ancient, EOLed and about to get hit hard w/a secops hammer.  we need to literally reinstall the entire cluster from scratch and get things working.

here are the high level bullet points about what's coming up in the next month:

** all amp-jenkins-worker-* nodes are running centos 6, and the remainder ubuntu 16.  these will be upgraded to ubuntu 20.

i will be doing this in stages so as to minimize downtime.

** ALL BUILDS NEED TO BE PORTED TO UBUNTU 20.  i can ensure that the environments on the nodes are identical, but i have yet been able to successfully build any SBT jobs on any version of ubuntu, and the MVN builds won't run on ubuntu 18 (tho they work fine on 16).  i also have had difficulty getting the PRB job to successfully finish on ubuntu.

for this, i will definitely need help from the dev community to get things working...  and the speed at which things are fixed will be inversely proportional to how much help i get.  :)

** amplab jenkins primary node will need two major upgrades:  OS from centos 6 to ubuntu 20, and jenkins from 1.6 to 2.X LTS... 

i'm most concerned about this, as it is literally the exact same jenkins installtion that patrick wendell set up over 10 years ago.  there are many publish secrets that are entered in to the jenkins config and i'd really hope that we don't lose them.

my plan here is to upgrade the current jenkins, and fix any things that break.  then we'll rsync jenkins' homedir to the new primary node and hope that works.  :)

** user audits

UC berkeley's new security standards require quarterly audits of non-affiliated accounts...  this won't impact only but a few people on this list, but i'll need to work w/campus and our department on solutions for this other than local accounts on the servers.

a LOT is going to happen, and i'm meeting w/my team today and will come up w/a basic plan.  we will definitely experience downtime during this, but i cannot guess as to what that will look like.

this might also be a good time to talk about the future of the build system, auditing our builds (do we need SBT?), or even finally getting around to dockerizing everything  so i don't need such a fragile and non-atomic set of worker nodes specifically for spark.

thoughts?  comments?

shane
 
ps -- this is one of the reasons why i haven't been around much lately...  it's been really tough keeping things up to date while trying to remotely train up one of my sysadmins to take over some of my build system duties.
--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu