Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jenkins CI host: manage available disk space #2362

Closed
jbergstroem opened this issue Jun 20, 2020 · 23 comments
Closed

Jenkins CI host: manage available disk space #2362

jbergstroem opened this issue Jun 20, 2020 · 23 comments
Labels

Comments

@jbergstroem
Copy link
Member

Some background here.

Seeing how month's worth (our jenkins rotation of cleaning old jobs) of build artifacts now exceed our disk size, we have two options:

  1. Reduce build history, or
  2. Expand disk space

Seeing how disk space is "cheap" I prefer the second option. I think we should create another volume - twice the size – and attach it to /var/lib/jenkins. I need to revisit our ansible status to see how to automate this, but we could do it manually and post-add the automation bits.

Thoughts?

@AshCripps
Copy link
Member

I think expanding is a good shout. Out of curiosity have we ever had any monitoring on these machines? particularly with the workspace machines getting a warning that the disk is going to fill up before it does and things go boom could be useful.

@jbergstroem
Copy link
Member Author

I think expanding is a good shout. Out of curiosity have we ever had any monitoring on these machines? particularly with the workspace machines getting a warning that the disk is going to fill up before it does and things go boom could be useful.

I'm a bit out of the loop but jenkins used to warn at 10%; someone still has do keep a steady eye on the monitors.

@mhdawson
Copy link
Member

I prefer expanding versus shortening the history

@jbergstroem
Copy link
Member Author

I will create a new volume and mount it for jenkins. Will announce a service period for this shortly due later this week in a TZ that fits the cause. I expect it to be relatively quick, but still.

Since we don't control vm initialization in ansible just yet, recreating CI with a larger disk base would be enough to cover the changes I'm doing (read: no need to add more ansible playbooks)

@rvagg
Copy link
Member

rvagg commented Jun 23, 2020

Out of curiosity have we ever had any monitoring on these machines

Something much discussed but never quite pulled off. There were a couple of attempts to set up a monitoring system for all our machines but nothing ever grew to maturity. You're welcome to try, and maybe being more limited in scope would be a way to actually get something working, I think previously the attempts have been to monitor ALL THE THINGS which might speak to their failure to eventuate.

@AshCripps
Copy link
Member

I think limiting the monitoring to just infra machine and a few other key machines like the workspace machines is the best bet, we should have enough redundancy in the test machines that one falling over shouldnt break us. There are plenty of open source solutions with loads of integrations (for example we could get alerts sent to our slack channel which should mirror to IRC, and if the warning is early enough people can get round to it when they can rather than it being an emergency).

@jbergstroem
Copy link
Member Author

Opened/scheduled an issue in #2366 to handle this.

@jbergstroem
Copy link
Member Author

jbergstroem commented Jun 25, 2020

@AshCripps during my Very Active build days (read: lots of free time) I started setting up a InfluxDB/Grafana stack since telegraf more or less runs everywhere. I don't know what you have in mind, but perhaps we can have a chat about it and see if there are quick wins?

@AshCripps
Copy link
Member

@jbergstroem funnily enough that was pretty much what I had in mind as something simple and widely used/supported

@jbergstroem
Copy link
Member Author

So, a few setbacks:

  • The datacenter we are in doesn't support volume storage
  • We cannot switch DC without changing IP which may impact the jenkins agents contract (needs to be confirmed). Also, lots of firewall-related stuff tied to agents

The other options available to us are:

  • Expand the current volume to the next size (640T instead of 320T drive) which also is not risk free
  • Lower how long we keep job history around from 30 days to.. something lower
  • Set up a new CI from scratch and migrate

I'd like to suggest we lower history with perhaps just a few days as a mitigation strategy while deciding if we should extend or redeploy.

@jbergstroem
Copy link
Member Author

I went through a few of the jobs and it looks to me that the job pruning script either isn't run or doesn't work as expected. I haven't been involved in this part much as of lately; I will wait for some confirmation about the current state before doing any potentially destructive (removal of jobs, resizing of disks, ..) actions.

@rvagg
Copy link
Member

rvagg commented Jun 29, 2020

@jbergstroem /var/spool/cron/crontabs/root does it weekly: /opt/local/bin/rsnapshot -c /opt/local/etc/rsnapshot.conf weekly && /root/backup_scripts/remove_old.sh ci-release.nodejs.org && /root/backup_scripts/remove_old.sh ci.nodejs.org

No guarantee that it's working properly of course! Might be worth running it manually and watching that it's doing the right thing? The regex might need updating too. 🤷

@rvagg
Copy link
Member

rvagg commented Jun 29, 2020

that's on our backup server btw, infra-joyent-smartos15-x64-1

@jbergstroem
Copy link
Member Author

jbergstroem commented Jun 29, 2020

@rvagg said:
that's on our backup server btw, infra-joyent-smartos15-x64-1

Ah, was looking in the wrong place. The file (remove_old.sh) has local modifications; will have to read up on the changes.

So, the remove job seems to work fine (I changed the local 7 day change to 21 for now) - I'm guessing that rsnapshot fails prior then since we condition the job removal on successful exit.

@jbergstroem
Copy link
Member Author

Just wanted to follow up here: I ran the script manually, once, giving us about half of the space back. I will be keeping an eye on this since the underlying issue - making the script not execute - needs fixing first.

@richardlau
Copy link
Member

I think we may have run out of space again.

I'm seeing lots of "java.io.IOException: No space left on device" messages in the Jenkins system log (https://ci.nodejs.org/log/all).

@richardlau
Copy link
Member

image

@richardlau
Copy link
Member

image

@sxa
Copy link
Member

sxa commented Aug 1, 2020

@jbergstroem resolved last night's problem - thanks!

"cleaning up now
backup host is the issue (issues with backup leading to not cleaning it)
we need to redeploy our backup host
it should be back in a few secs"

@mmarchini
Copy link
Contributor

One thing I raised on IRC while this incident was ongoing is that maybe we want a couple more folks on infra admin to increase our coverage/availability.

@jbergstroem
Copy link
Member Author

One thing I raised on IRC while this incident was ongoing is that maybe we want a couple more folks on infra admin to increase our coverage/availability.

Absolutely; in parallel to this we can increase visibility of this happening by measuring it in grafana (which we already do, ACL was finished just the other day) and allowing more people to monitor it (which is he next step).

@richardlau
Copy link
Member

Public CI's disk was full again today. I logged into backup and ran the scripts in #2362 (comment). This appears to have recovered disk space, but I got errors running the scripts that probably need further looking into.

# /root/backup_scripts/remove_old.sh ci-release.nodejs.org && /root/backup_scripts/remove_old.sh ci.nodejs.org
curl: (60) SSL certificate problem: certificate has expired
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

and

#  /root/backup_scripts/remove_old.sh ci.nodejs.org
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 403 No valid crumb was included in the request</title>
</head>
<body><h2>HTTP ERROR 403 No valid crumb was included in the request</h2>
<table>
<tr><th>URI:</th><td>/reload</td></tr>
<tr><th>STATUS:</th><td>403</td></tr>
<tr><th>MESSAGE:</th><td>No valid crumb was included in the request</td></tr>
<tr><th>SERVLET:</th><td>Stapler</td></tr>
</table>
<hr><a href="http://eclipse.org/jetty">Powered by Jetty:// 9.4.30.v20200611</a><hr/>

</body>
</html>

cc @nodejs/build-infra

@github-actions
Copy link

github-actions bot commented Sep 6, 2021

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants