Jenkins CI host: manage available disk space #2362

jbergstroem · 2020-06-20T15:37:11Z

Seeing how month's worth (our jenkins rotation of cleaning old jobs) of build artifacts now exceed our disk size, we have two options:

Reduce build history, or
Expand disk space

Seeing how disk space is "cheap" I prefer the second option. I think we should create another volume - twice the size – and attach it to /var/lib/jenkins. I need to revisit our ansible status to see how to automate this, but we could do it manually and post-add the automation bits.

Thoughts?

The text was updated successfully, but these errors were encountered:

AshCripps · 2020-06-20T18:51:05Z

I think expanding is a good shout. Out of curiosity have we ever had any monitoring on these machines? particularly with the workspace machines getting a warning that the disk is going to fill up before it does and things go boom could be useful.

jbergstroem · 2020-06-21T00:01:52Z

I think expanding is a good shout. Out of curiosity have we ever had any monitoring on these machines? particularly with the workspace machines getting a warning that the disk is going to fill up before it does and things go boom could be useful.

I'm a bit out of the loop but jenkins used to warn at 10%; someone still has do keep a steady eye on the monitors.

mhdawson · 2020-06-22T21:26:32Z

I prefer expanding versus shortening the history

jbergstroem · 2020-06-23T05:38:50Z

I will create a new volume and mount it for jenkins. Will announce a service period for this shortly due later this week in a TZ that fits the cause. I expect it to be relatively quick, but still.

Since we don't control vm initialization in ansible just yet, recreating CI with a larger disk base would be enough to cover the changes I'm doing (read: no need to add more ansible playbooks)

rvagg · 2020-06-23T05:39:39Z

Out of curiosity have we ever had any monitoring on these machines

Something much discussed but never quite pulled off. There were a couple of attempts to set up a monitoring system for all our machines but nothing ever grew to maturity. You're welcome to try, and maybe being more limited in scope would be a way to actually get something working, I think previously the attempts have been to monitor ALL THE THINGS which might speak to their failure to eventuate.

AshCripps · 2020-06-23T11:30:49Z

I think limiting the monitoring to just infra machine and a few other key machines like the workspace machines is the best bet, we should have enough redundancy in the test machines that one falling over shouldnt break us. There are plenty of open source solutions with loads of integrations (for example we could get alerts sent to our slack channel which should mirror to IRC, and if the warning is early enough people can get round to it when they can rather than it being an emergency).

jbergstroem · 2020-06-25T23:06:30Z

Opened/scheduled an issue in #2366 to handle this.

jbergstroem · 2020-06-25T23:09:18Z

@AshCripps during my Very Active build days (read: lots of free time) I started setting up a InfluxDB/Grafana stack since telegraf more or less runs everywhere. I don't know what you have in mind, but perhaps we can have a chat about it and see if there are quick wins?

AshCripps · 2020-06-25T23:14:10Z

@jbergstroem funnily enough that was pretty much what I had in mind as something simple and widely used/supported

jbergstroem · 2020-06-28T17:16:38Z

So, a few setbacks:

The datacenter we are in doesn't support volume storage
We cannot switch DC without changing IP which may impact the jenkins agents contract (needs to be confirmed). Also, lots of firewall-related stuff tied to agents

The other options available to us are:

Expand the current volume to the next size (640T instead of 320T drive) which also is not risk free
Lower how long we keep job history around from 30 days to.. something lower
Set up a new CI from scratch and migrate

I'd like to suggest we lower history with perhaps just a few days as a mitigation strategy while deciding if we should extend or redeploy.

jbergstroem · 2020-06-29T00:38:25Z

I went through a few of the jobs and it looks to me that the job pruning script either isn't run or doesn't work as expected. I haven't been involved in this part much as of lately; I will wait for some confirmation about the current state before doing any potentially destructive (removal of jobs, resizing of disks, ..) actions.

rvagg · 2020-06-29T02:06:09Z

@jbergstroem /var/spool/cron/crontabs/root does it weekly: /opt/local/bin/rsnapshot -c /opt/local/etc/rsnapshot.conf weekly && /root/backup_scripts/remove_old.sh ci-release.nodejs.org && /root/backup_scripts/remove_old.sh ci.nodejs.org

No guarantee that it's working properly of course! Might be worth running it manually and watching that it's doing the right thing? The regex might need updating too. 🤷

rvagg · 2020-06-29T02:06:39Z

that's on our backup server btw, infra-joyent-smartos15-x64-1

jbergstroem · 2020-06-29T02:39:49Z

@rvagg said:
that's on our backup server btw, infra-joyent-smartos15-x64-1

Ah, was looking in the wrong place. The file (remove_old.sh) has local modifications; will have to read up on the changes.

So, the remove job seems to work fine (I changed the local 7 day change to 21 for now) - I'm guessing that rsnapshot fails prior then since we condition the job removal on successful exit.

jbergstroem · 2020-07-01T03:32:52Z

Just wanted to follow up here: I ran the script manually, once, giving us about half of the space back. I will be keeping an eye on this since the underlying issue - making the script not execute - needs fixing first.

richardlau · 2020-07-31T15:19:33Z

I think we may have run out of space again.

I'm seeing lots of "java.io.IOException: No space left on device" messages in the Jenkins system log (https://ci.nodejs.org/log/all).

richardlau · 2020-07-31T15:27:39Z

richardlau · 2020-07-31T15:52:37Z

sxa · 2020-08-01T09:23:25Z

@jbergstroem resolved last night's problem - thanks!

"cleaning up now
backup host is the issue (issues with backup leading to not cleaning it)
we need to redeploy our backup host
it should be back in a few secs"

mmarchini · 2020-08-01T19:16:37Z

One thing I raised on IRC while this incident was ongoing is that maybe we want a couple more folks on infra admin to increase our coverage/availability.

jbergstroem · 2020-08-02T22:30:45Z

One thing I raised on IRC while this incident was ongoing is that maybe we want a couple more folks on infra admin to increase our coverage/availability.

Absolutely; in parallel to this we can increase visibility of this happening by measuring it in grafana (which we already do, ACL was finished just the other day) and allowing more people to monitor it (which is he next step).

richardlau · 2020-11-09T16:36:07Z

Public CI's disk was full again today. I logged into backup and ran the scripts in #2362 (comment). This appears to have recovered disk space, but I got errors running the scripts that probably need further looking into.

# /root/backup_scripts/remove_old.sh ci-release.nodejs.org && /root/backup_scripts/remove_old.sh ci.nodejs.org
curl: (60) SSL certificate problem: certificate has expired
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

and

#  /root/backup_scripts/remove_old.sh ci.nodejs.org
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 403 No valid crumb was included in the request</title>
</head>
<body><h2>HTTP ERROR 403 No valid crumb was included in the request</h2>
<table>
<tr><th>URI:</th><td>/reload</td></tr>
<tr><th>STATUS:</th><td>403</td></tr>
<tr><th>MESSAGE:</th><td>No valid crumb was included in the request</td></tr>
<tr><th>SERVLET:</th><td>Stapler</td></tr>
</table>
<hr><a href="http://eclipse.org/jetty">Powered by Jetty:// 9.4.30.v20200611</a><hr/>

</body>
</html>

cc @nodejs/build-infra

github-actions · 2021-09-06T00:08:07Z

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

richardlau mentioned this issue Jul 31, 2020

tools: fix C++ import checker argument expansion nodejs/node#34582

Closed

3 tasks

mmarchini mentioned this issue Aug 19, 2020

Request to join @nodejs/build-infra #2418

Closed

github-actions bot added the stale label Sep 6, 2021

github-actions bot closed this as completed Oct 6, 2021

richardlau mentioned this issue Apr 6, 2023

Disk space too low on the Jenkins CI server #3288

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jenkins CI host: manage available disk space #2362

Jenkins CI host: manage available disk space #2362

jbergstroem commented Jun 20, 2020

AshCripps commented Jun 20, 2020

jbergstroem commented Jun 21, 2020

mhdawson commented Jun 22, 2020

jbergstroem commented Jun 23, 2020

rvagg commented Jun 23, 2020

AshCripps commented Jun 23, 2020

jbergstroem commented Jun 25, 2020

jbergstroem commented Jun 25, 2020 •

edited

Loading

AshCripps commented Jun 25, 2020

jbergstroem commented Jun 28, 2020

jbergstroem commented Jun 29, 2020

rvagg commented Jun 29, 2020

rvagg commented Jun 29, 2020

jbergstroem commented Jun 29, 2020 •

edited

Loading

jbergstroem commented Jul 1, 2020

richardlau commented Jul 31, 2020

richardlau commented Jul 31, 2020

richardlau commented Jul 31, 2020

sxa commented Aug 1, 2020

mmarchini commented Aug 1, 2020

jbergstroem commented Aug 2, 2020

richardlau commented Nov 9, 2020

github-actions bot commented Sep 6, 2021

Jenkins CI host: manage available disk space #2362

Jenkins CI host: manage available disk space #2362

Comments

jbergstroem commented Jun 20, 2020

AshCripps commented Jun 20, 2020

jbergstroem commented Jun 21, 2020

mhdawson commented Jun 22, 2020

jbergstroem commented Jun 23, 2020

rvagg commented Jun 23, 2020

AshCripps commented Jun 23, 2020

jbergstroem commented Jun 25, 2020

jbergstroem commented Jun 25, 2020 • edited Loading

AshCripps commented Jun 25, 2020

jbergstroem commented Jun 28, 2020

jbergstroem commented Jun 29, 2020

rvagg commented Jun 29, 2020

rvagg commented Jun 29, 2020

jbergstroem commented Jun 29, 2020 • edited Loading

jbergstroem commented Jul 1, 2020

richardlau commented Jul 31, 2020

richardlau commented Jul 31, 2020

richardlau commented Jul 31, 2020

sxa commented Aug 1, 2020

mmarchini commented Aug 1, 2020

jbergstroem commented Aug 2, 2020

richardlau commented Nov 9, 2020

github-actions bot commented Sep 6, 2021

jbergstroem commented Jun 25, 2020 •

edited

Loading

jbergstroem commented Jun 29, 2020 •

edited

Loading