-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚨🚨🚨🚨 CI IS DOWN!!!! 🚨🚨🚨🚨 #825
Comments
on it |
Seems back (presumably thanks to @rvagg ). |
Fixed and it's up and running @nodejs/build Disk space problem yet again, but this time I dug a bit deeper and found that it's the workspaces, not the job data, that's causing us most grief with disk. Every time a job is run the master does the initial clone to manage the process but then we end up with a lot of clones of some big repos and Jenkins is pretty messy about it, making multiple workspaces, even ones with Unfortunately it's not obvious to me how we could clean these up automatically. We could schedule a cron and delete but we don't want to be deleting workspaces that are in use. A "last modified" check might do the trick I suppose. I believe that Jenkins doesn't keep internal state about the workspaces, they are just files on disk to be touched whenever. Another option is to shunt this work off onto another host, a secondary, that the master uses for all of these workspaces. We have a rule in there that forces this work to be done on the master rather than some random node that's connected (that's the default). I think we could connect a secondary server with a really big disk as a slave node and have it do all of this workspace stuff, leaving the master to manage job coordination. That may have an additional side benefit of making the master more efficient and possibly faster (just a guess). |
100% down to 15% just by deleting workspaces FYI. |
Moving git clones off master (second suggestion) sounds like a good idea, running Jenkins is more than enough for one machine in my experience. |
Also, @nodejs/build, it may take a bit of coaxing to get all of these nodes reconnected. Some of them may not be retrying so any help in getting them back online will be appreciated. |
@rvagg a list of the ones you expect to be online would be useful, trying to run
I tried downloading a new slave.jar from https://ci.nodejs.org/computer/test-softlayer-centos6-x64-2/, but it gives this error:
Do we expect this machine to connect? |
Okay, everything is back except for the Pis and these machines:
|
Yes, I'm pretty sure that machine was working last week, I was working in the other centos6-x64 which was offline but this one was still fine. What JVM is it using? That error sucks because there's no clear way to fix it. Just make sure you have the slave.jar from ci.nodejs.org and an updated JVM. I'll get on soon and see what I can do if you can't make headway. Regarding which machines should be online - all of them, unless you can't SSH in, most of the ones not in the ARM cluster are good, I have a bunch of pi's offline though. |
test-mininodes-ubuntu1604-arm64_odroid_c2-1 needs a restart, it developed problems yesterday. I'm getting David @ miniNodes to deal with it. test-digitalocean-freebsd10-x64-1 is an interesting one, I was trying to get it online on the weekend but failed - I've tried hard-rebooting it to no avail, I can get the web console open via digitalocean and it even responds (I can't login there of course). So it looks like a network problem. @jbergstroem should we just reprovision this machine? Is ansible OK with these in its current form? I've never done a freebsd provision before. Working on the centos6 machines now. |
Both centos6-64 back, |
cleaned up a few more hosts too, looks like we're back on track now except for the freebsd10 |
I think the aix failures in CitGM are related to the restart: |
I cleaned up some old processes and restarted the jenkins agent. A lot of the test ran but still a bunch of failures. What I can't tell is if this is different from before as citgm has lots of red overall. https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/nodes=aix61-ppc64/952/consoleFull |
@nodejs/build - I've just made some major changes to the way CI executes
It's possible that we may have some job configuration problems from this, so if things seem to be failing with whacky reasons then this could be the cause. There may be more ironing out to do. But this should take a big load off master and should take the disk pressure off there too so we should even be able to extend the number of days we retain data up from the current 5 (or 7, I don't recall what it was when I last looked). |
Can anyone help?
The text was updated successfully, but these errors were encountered: