-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jenkins CI host: manage available disk space #2362
Comments
I think expanding is a good shout. Out of curiosity have we ever had any monitoring on these machines? particularly with the workspace machines getting a warning that the disk is going to fill up before it does and things go boom could be useful. |
I'm a bit out of the loop but jenkins used to warn at 10%; someone still has do keep a steady eye on the monitors. |
I prefer expanding versus shortening the history |
I will create a new volume and mount it for jenkins. Will announce a service period for this shortly due later this week in a TZ that fits the cause. I expect it to be relatively quick, but still. Since we don't control vm initialization in ansible just yet, recreating CI with a larger disk base would be enough to cover the changes I'm doing (read: no need to add more ansible playbooks) |
Something much discussed but never quite pulled off. There were a couple of attempts to set up a monitoring system for all our machines but nothing ever grew to maturity. You're welcome to try, and maybe being more limited in scope would be a way to actually get something working, I think previously the attempts have been to monitor ALL THE THINGS which might speak to their failure to eventuate. |
I think limiting the monitoring to just infra machine and a few other key machines like the workspace machines is the best bet, we should have enough redundancy in the test machines that one falling over shouldnt break us. There are plenty of open source solutions with loads of integrations (for example we could get alerts sent to our slack channel which should mirror to IRC, and if the warning is early enough people can get round to it when they can rather than it being an emergency). |
Opened/scheduled an issue in #2366 to handle this. |
@AshCripps during my Very Active build days (read: lots of free time) I started setting up a InfluxDB/Grafana stack since telegraf more or less runs everywhere. I don't know what you have in mind, but perhaps we can have a chat about it and see if there are quick wins? |
@jbergstroem funnily enough that was pretty much what I had in mind as something simple and widely used/supported |
So, a few setbacks:
The other options available to us are:
I'd like to suggest we lower history with perhaps just a few days as a mitigation strategy while deciding if we should extend or redeploy. |
I went through a few of the jobs and it looks to me that the job pruning script either isn't run or doesn't work as expected. I haven't been involved in this part much as of lately; I will wait for some confirmation about the current state before doing any potentially destructive (removal of jobs, resizing of disks, ..) actions. |
@jbergstroem /var/spool/cron/crontabs/root does it weekly: No guarantee that it's working properly of course! Might be worth running it manually and watching that it's doing the right thing? The regex might need updating too. 🤷 |
that's on our backup server btw, infra-joyent-smartos15-x64-1 |
Ah, was looking in the wrong place. The file ( So, the remove job seems to work fine (I changed the local 7 day change to 21 for now) - I'm guessing that rsnapshot fails prior then since we condition the job removal on successful exit. |
Just wanted to follow up here: I ran the script manually, once, giving us about half of the space back. I will be keeping an eye on this since the underlying issue - making the script not execute - needs fixing first. |
I think we may have run out of space again. I'm seeing lots of "java.io.IOException: No space left on device" messages in the Jenkins system log (https://ci.nodejs.org/log/all). |
@jbergstroem resolved last night's problem - thanks!
|
One thing I raised on IRC while this incident was ongoing is that maybe we want a couple more folks on infra admin to increase our coverage/availability. |
Absolutely; in parallel to this we can increase visibility of this happening by measuring it in grafana (which we already do, ACL was finished just the other day) and allowing more people to monitor it (which is he next step). |
Public CI's disk was full again today. I logged into
and
cc @nodejs/build-infra |
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made. |
Some background here.
Seeing how month's worth (our jenkins rotation of cleaning old jobs) of build artifacts now exceed our disk size, we have two options:
Seeing how disk space is "cheap" I prefer the second option. I think we should create another volume - twice the size – and attach it to
/var/lib/jenkins
. I need to revisit our ansible status to see how to automate this, but we could do it manually and post-add the automation bits.Thoughts?
The text was updated successfully, but these errors were encountered: