-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pi3 troubles #2365
Comments
I'm not sure if this has been raised elsewhere (I couldn't spot a recent build issue for it), but it seems we're down to 2 Pis @ https://ci.nodejs.org/label/pi3-raspbian-stretch/ |
I mentioned it in the build Slack channel yesterday. I wasn't able to ssh into any of the offline pi3's. |
I'm on it. The Pi3's have been very unstable and I haven't been able to make much of a dent on their stability. I'll try another update today to see if something new comes in to solve this. In the meantime, if this bottleneck becomes chronic, disable the arm-fanned job entirely to remove it from consideration, it's not the biggest loss and I can usually get ontop of it within 24 hours if notified. |
It seems like the a number of Pis have recently dropped off (also reported in Slack by @Trott).
I think it's starting to backlog (trend is showing a few 4h+ builds). Would someone be able to disable the arm-fanned job temporarily? |
I've disabled cc @nodejs/build-infra |
Copying over from Rod's comment in #2510 (comment):
|
https://ci.nodejs.org/label/pi3-docker/ 3 are already offline, sigh, I'll do my best to keep them alive but I might need to be reminded now and again |
I disabled the arm-fanned job yesterday. https://node-js.slack.com/archives/C16SCB5JQ/p1610575369213600 |
I opened #2523 yesterday about that. |
I've re-enabled arm-fanned, it's working and I got greens on v15.x, but I think master might be in a poor state as far as the Pi's are concerned, I don't know if things have slipped since it's been disabled or if we just have a large build-up of flakeys but it's not looking so awesome. Feel free to disable again if there's something that's genuinely a problem - like holding up a release or consistently taking >1h to complete. Otherwise it should be treated as a valid test environment that needs to be taken care of when things break. btw: I think it would be nice if it were disabled @ https://ci.nodejs.org/job/node-test-commit-arm-fanned/ rather than within node-test-commit itself, it's a bit tucked away in there but is a nice big "Disable" and "Enable" button in the job itself and it does the same thing. |
Linking #2583. |
I put this in Slack/IRC too: Down to a single pi3 host on Jenkins again. https://ci.nodejs.org/label/pi3-raspbian-stretch/ This is an annoying-but-tolerable bottleneck on the weekend. It's going to be pretty bad come Monday and is probably not great for the security release that's supposed to go out on Tuesday. I'll see if I can find out what to do from the build repo beyond disabling Pi testing altogether. |
Brought a bunch back online, a few still that won't come up. I'm going to try and find some time to play with getting these upgraded to the newer raspbian (now "Raspberry Pi OS"), I think this is a kernel issue that isn't being fixed because the OS is too old. |
@rvagg It looks like all Pi devices are failing everything now. NFS problem? I'm seeing things like this:
...and this:
|
yeah, nfs server crash, getting it all unblocked now @Trott |
It's a Pi2 and not a Pi3, but https://ci.nodejs.org/computer/test-requireio_rvagg-debian10-armv7l_pi2-1/ has been build-failing consistently so I marked it offline. As far as I can tell, everything else is in working order at the moment. |
I've restarted the agent on https://ci.nodejs.org/computer/test-requireio_rvagg-debian10-armv7l_pi2-1/. |
Looks like we've been down to just two working/online Raspberry PI 3 devices for the past 24 hours or so. https://ci.nodejs.org/label/pi3-raspbian-stretch/ |
my big firmware and in-place upgrades didn't do the trick |
Back to two Pi 3's again and those two are failing to remove a file (nfs):
https://ci.nodejs.org/job/node-test-binary-arm-12+/10387/RUN_SUBSET=0,label=pi3-docker/console |
I removed the workspace on the two working Pi 3's 🤞. |
we have more back online but they probably won't last! |
FWIW (this is a Pi 2 so not sure if it belongs here) I noticed that test-requireio_rvagg-debian10-armv7l_pi2-1 was marked offline for "repeated build failures". I logged into the machine and saw lots of running
(looks related to pummel?). I've cleaned up the processes and brought the host back online.
|
I've added
after the existing
to kill any running node processes as well as any defunct zombies. This should hopefully reduce the number of build failures when |
It appears that we're down to 3 online Pi 3 devices again. I wonder if it's the same 3 every time. That does seem to be the magic number. |
I think it might be. For the record, 1, 3 and 6 (according to my numbering which is different to the public naming .. sorry) are still online. I remember the last time (last week?) I did this, 1 and 3 were the ones still alive (can't remember which one was the third one still alive), and 3 was the one I reprovisioned from scratch recently. Will give the rest a bump right now. |
status update in #2661, I think we can close this but I'll continue to monitor the situation |
Since I redid the Pi boot disk arrangement a few weeks ago, the Pi 3's have been having problems staying online. The Pi 1's and 2's seem to be just fine, all of them are online and none seem to have needed fixing. The 1's are used less, but the 2's should be used as much as the 3's. I think the
git-update-nodesource
job might be pinned to the 3's but that should be the main difference. Yet twice now I've had to deal with more than half of them being unresponsive. Right now, all but 1, 6 and 9 (ordered by the numbers they appear in inventory) were offline. I've started them all up again and am doing an update on just the 3's just in case there's something in there that helps out.I'm suspecting OS problems that are impacting just this generation of device. It could also be related to Docker, which is a continual source of trouble on ARM. As yet I have no evidence of what might be locking them up and am just making guesses.
Please help keep an eye on them, I might need reminding if too many disappear. It'll also be interesting to note which ones are impacted, there may be a pattern.
The text was updated successfully, but these errors were encountered: