-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow remote e2e tests #15272
Comments
I noticed a lot of yellow $ cirrus-pr-timing --download 15275
...
$ cd cirrus-pr-timing.5394951126122496
$ grep -c 'SLOW TEST' int-*-root-* (formatted for legibility)
int-podman-fedora-36-root-container : 128
int-podman-fedora-36-root-host : 149
int-podman-ubuntu-2204-root-host : 185
int-remote-fedora-36-root-host : 509 <---
int-remote-ubuntu-2204-root-host : 506 <--- Obvious next step is trying to reproduce on my laptop... but no luck: $ for i in "" "-remote"; do time sudo ./ginkgo $i "pause a checkpointed container by id" >/dev/null;done
sudo ./ginkgo $i "pause a checkpointed container by id" > /dev/null 7.08s user 1.67s system 79% cpu 11.016 total
sudo ./ginkgo $i "pause a checkpointed container by id" > /dev/null 7.01s user 1.67s system 85% cpu 10.173 total [this is my ginkgo script]. Most important is its setting of Can't think of anything else to try right now. |
Slightly better luck with rmi with cached images`, which seems to be the slowest of the slow (111s in CI): $ for i in "" "-remote"; do time sudo ./ginkgo $i "rmi with cached images" >/dev/null;done
sudo ./ginkgo $i "rmi with cached images" > /dev/null 7.88s user 2.50s system 110% cpu 9.375 total
sudo ./ginkgo $i "rmi with cached images" > /dev/null 13.31s user 3.97s system 119% cpu 14.485 total Takes almost double the time, but still O(10s) not O(100s). When I run it manually the test fails (not gonna even look at why), but the slow steps are the Anyhow, FWIW. |
Thanks for the valuable insights, @edsantiago ! |
A friendly reminder that this issue had no activity for 30 days. |
Is this moving forward? |
@baude, did you find time to look further into this? |
Hi, just checking in. I ran
|
A friendly reminder that this issue had no activity for 30 days. |
Here's my periodic
Dramatic improvement in remote tests. Does anyone have a clue why? Am I going to need to bisect to figure out when things changed? |
@edsantiago a lot of performance improvements went in recently. Aren't we spinning up a new |
Good thinking... but that's not it.
Subsequent PRs with that as merge-base have similar timings. This strongly suggests that the problem was in the VM images, not in podman itself. This is good (yay podman) and bad (we need to be SUPER CAREFUL now when switching VM images) (and, we probably need to spend some time to understand just what the heck the difference is). |
Because there's talk of bumping up VM images, here's a sample timing run from today (PR #17056):
|
I think we can close the issue and open a new one "slow local e2e tests" 🤣 |
Because there's no other place to keep track of PR timings, here's cirrus-pr-timing for 17589, a PR run with the latest CI VMs based on 17503 (the Debian one):
...looks pretty much the same to me, but I just eyeballed, did not script-diff. |
Continuing the tradition of using this issue as a central holding pen for PR timing results, here's a table that now includes SQLite results (pr 17855):
|
I am personally happy with the current state of affairs. When creating the issues, the delta was really big but has significantly improved since then. @edsantiago, out of curiosity. Do we have average/median time to run 1 test? Some tests are skipped for remote and some for local, so I am curious what the average times may look like. |
I'm not sure I understand your question. Like, do I gather statistics on all PRs and aggregate them? No, sorry. On my fantasy wishlist but much too tricky to spend serious time on. Your "1 test" confuses me though, it seems to suggest something like averaging between root/rootless/remote and different distros? |
What I meant is time per test: |
Eeeh, the other way around for sure |
I still don't understand, possibly because "test" is such an overloaded term. Do you mean "number of individual If that's not what you mean, could you give me an example? Sorry to be so dense today. |
Yes, that's what I am interested in. But I see this as nice to have and do not think at all it's critical. |
We've got a problem on debian. Remote e2e tests are consistently taking 49m to run (compare to 39-42 for all fedoras). Which means that every so often they exceed 50m and get killed by the Cirrus timeout. I will not have time to investigate this until next week.
|
Thanks for sharing, @edsantiago. Smells like a deadlock. Cc: @mheon |
I'm looking for PRs of examples of |
sqlite is because all of those are #17831, my hammer-away PR. debian+sqlite is not a real-world case. Sorry for the confusion. Here's a 46-minute run, from my last test run yesterday. This is nothing to do with sqlite, though. Debian-remote is in the 46-to-49-minute range (super slow) on all CI runs. |
Looking at the test timings in the linked run - there are a few long tests, but not an unreasonable amount, and even the longest is only a bit over a minute. It feels like every test being run is just slow - it feels like we have hundreds of tests taking 2-3 seconds, for example. So this is a systemic issue, not individual tests running really poorly on Debian. Given this I really doubt we're looking at deadlocks. I wonder if it's something to do with |
I agree. Doesn´t look like a deadlock or a sqlite fart. Need to track down what's different with the Debian images. I recall that Go on Ubuntu was once responsible for that. |
I'm going to close this as i think, in spirit, we addressed the original complaint. Specific issues should now be filed and we can fix them. |
Thank you for noticing this. Yes, it can be closed, we now have parity between remote & local e2e tests:
(source: #19887, a recently-closed PR) |
Great work everyone! |
The remote e2e tests are noticeably slower (see table below) than the local ones. I some cases up by a factor of two. Speeding the e2e tests up would increase the feedback loop of CI and probably reduce the costs of CI.
For a moment I though that the remote tests would load the cache via the remote client but as it turned out in #15266 the remote tests are loading the local tar files via the local client already.
Assigning to @baude who raised interest in taking a closer look.
The text was updated successfully, but these errors were encountered: