please read: current state and the future of the apache spark build system

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

please read: current state and the future of the apache spark build system

shane knapp ☠
this will be a relatively big update, as there are many many moving pieces with short, medium and long term goals.

TLDR1:   we're shutting jenkins down at the end of 2021.

TLDR2:  i know we're way behind on pretty much everything.  most of the hardware is at or beyond EOL, and random systemic build failures (like k8s/minikube) are randomly popping up.  i've had to restrict access due to new campus policies, and i will be dealing with that shortly and only for a few contributors.

long term (until EOY):
* decide what the future of spark builds and releases will look like
  - do we need jenkins?
  - if we do, who's responsible for hosting + ops?
* we will permanently shut down amplab jenkins by the end of 2021
  - uc berkeley has funded this for over 10 years, and both the funds and staff (only me, for 7 years) are going away.  i'm staying at cal, but have a much different job now.  :)

medium term (in 6 months):
* prepare jenkins worker ansible configs and stick in the spark repo
  - nothing fancy, but enough to config ubuntu workers
  - could be used to create docker containers for testing in <wavey-hands>THE CLOUD</wavey-hands>
* train up brian shiratsuki (cced) to help w/ops tasks and upgrades over the next ~6m
* get to all of the python version, library installation, etc etc jira requests

short term(weeks):
* debug and figure out why minikube/k8s broke
  - i really could use some help here...
* bring up additional workers
  - finish hardware/system level repairs on the bare metal
  - see above, re k8s jira
* stabilize cluster
  - recent jenkins LTS upgrade broke the web GUI
  - finish deploying monitoring/alerting
  - this hardware is OLD and literally falling over, so we have lots of random disk and ram failures.  it's literally whack-a-mole and each trip to the colo to repair literally takes a full day

i'm only able to spend a few hours a week on the build system, so expect random downtime, reboots, restarts, and testing.  we're testing new nodes as we deploy, and hoping to fix anything before releasing them into the wild, but some things might be flaky.

but the biggest question is what you all need w/regards to build infrastructure...  and who's going to be responsible for it.

thanks for reading!  :)

shane
--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: please read: current state and the future of the apache spark build system

shane knapp ☠
medium term (in 6 months):
* prepare jenkins worker ansible configs and stick in the spark repo
  - nothing fancy, but enough to config ubuntu workers
  - could be used to create docker containers for testing in <wavey-hands>THE CLOUD</wavey-hands>

fwiw, i just decided to bang this out today:

shane
--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: please read: current state and the future of the apache spark build system

Hyukjin Kwon
Thanks Shane!!

On Thu, 15 Apr 2021, 09:03 shane knapp ☠, <[hidden email]> wrote:
medium term (in 6 months):
* prepare jenkins worker ansible configs and stick in the spark repo
  - nothing fancy, but enough to config ubuntu workers
  - could be used to create docker containers for testing in <wavey-hands>THE CLOUD</wavey-hands>

fwiw, i just decided to bang this out today:

shane
--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: please read: current state and the future of the apache spark build system

Holden Karau
Thanks Shane for keeping the build infra structure running for all of
these years :)

I've got some Kubernetes infra on AS399306 down in HE in Fremont but
it's also perhaps not of the newest variety, but so far no disk
failures or anything like that (knock on wood of course). The catch is
it's on a 15 amp circuit and frankly I'm still learning how BGP works.

Maybe we could expirement with
https://github.com/lazybit-ch/actions-runner/tree/master/actions-runner
and try nested MiniKube (which I know is... not great but might make
things more portable)?

Would the community (and or some of our corporate contributors) be
open to contributing some hardware + power money or cloud credits?

On Wed, Apr 14, 2021 at 5:13 PM Hyukjin Kwon <[hidden email]> wrote:

>
> Thanks Shane!!
>
> On Thu, 15 Apr 2021, 09:03 shane knapp ☠, <[hidden email]> wrote:
>>>
>>> medium term (in 6 months):
>>> * prepare jenkins worker ansible configs and stick in the spark repo
>>>   - nothing fancy, but enough to config ubuntu workers
>>>   - could be used to create docker containers for testing in <wavey-hands>THE CLOUD</wavey-hands>
>>>
>> fwiw, i just decided to bang this out today:
>> https://github.com/apache/spark/pull/32178
>>
>> shane
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu



--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: please read: current state and the future of the apache spark build system

Yikun Jiang
Much thanks for your work on infra @Shane. Especially, we (I and @huangtianhua) got really much help from you when make Arm CI work. [1]

prepare jenkins worker ansible configs and stick in the spark repo

https://github.com/apache/spark/pull/32178 I take a quick glance on it, it seems it doesn't contain any Arm node setup and config related code.

Do you have any plan to update the existing code to cover the Arm node setup and configuration? or just some exiting script is also okay. 

Do you have any special plan on Arm node migration? If needed, I will help some the Arm related node setup and config in new infra to make sure Spark Arm CI work.

BTW, We also is considering to move the Arm build from jenkins to Github Action (using self-host or cloud deploy https://github.com/actions/starter-workflows/tree/main/ci), there are some pre-work is being done by our team see PoC in [2]. (cc @mgrigorov)[2], maybe it could bring some idea on future infrastructure.

[1] https://amplab.cs.berkeley.edu/jenkins/label/spark-arm/

Holden Karau <[hidden email]> 于2021年4月15日周四 上午8:29写道:
Thanks Shane for keeping the build infra structure running for all of
these years :)

I've got some Kubernetes infra on AS399306 down in HE in Fremont but
it's also perhaps not of the newest variety, but so far no disk
failures or anything like that (knock on wood of course). The catch is
it's on a 15 amp circuit and frankly I'm still learning how BGP works.

Maybe we could expirement with
https://github.com/lazybit-ch/actions-runner/tree/master/actions-runner
and try nested MiniKube (which I know is... not great but might make
things more portable)?

Would the community (and or some of our corporate contributors) be
open to contributing some hardware + power money or cloud credits?

On Wed, Apr 14, 2021 at 5:13 PM Hyukjin Kwon <[hidden email]> wrote:
>
> Thanks Shane!!
>
> On Thu, 15 Apr 2021, 09:03 shane knapp ☠, <[hidden email]> wrote:
>>>
>>> medium term (in 6 months):
>>> * prepare jenkins worker ansible configs and stick in the spark repo
>>>   - nothing fancy, but enough to config ubuntu workers
>>>   - could be used to create docker containers for testing in <wavey-hands>THE CLOUD</wavey-hands>
>>>
>> fwiw, i just decided to bang this out today:
>> https://github.com/apache/spark/pull/32178
>>
>> shane
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu



--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]