Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow remote e2e tests #15272

Closed
vrothberg opened this issue Aug 10, 2022 · 30 comments
Closed

slow remote e2e tests #15272

vrothberg opened this issue Aug 10, 2022 · 30 comments
Assignees
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@vrothberg
Copy link
Member

The remote e2e tests are noticeably slower (see table below) than the local ones. I some cases up by a factor of two. Speeding the e2e tests up would increase the feedback loop of CI and probably reduce the costs of CI.

type distro user local remote container
int fedora-36 root 30:01 48:17 29:23
int ubuntu-2204 root 30:27 52:12

For a moment I though that the remote tests would load the cache via the remote client but as it turned out in #15266 the remote tests are loading the local tar files via the local client already.

Assigning to @baude who raised interest in taking a closer look.

@edsantiago
Copy link
Member

I noticed a lot of yellow SLOW TEST warnings while poking around on one of my PRs today, and:

$ cirrus-pr-timing --download 15275
...
$ cd cirrus-pr-timing.5394951126122496
$ grep -c 'SLOW TEST' int-*-root-*    (formatted for legibility)
int-podman-fedora-36-root-container  : 128
int-podman-fedora-36-root-host       : 149
int-podman-ubuntu-2204-root-host     : 185
int-remote-fedora-36-root-host       : 509   <---
int-remote-ubuntu-2204-root-host     : 506   <---

Obvious next step is trying to reproduce on my laptop... but no luck:

$ for i in "" "-remote"; do time sudo ./ginkgo $i "pause a checkpointed container by id" >/dev/null;done
sudo ./ginkgo $i "pause a checkpointed container by id" > /dev/null  7.08s user 1.67s system 79% cpu 11.016 total
sudo ./ginkgo $i "pause a checkpointed container by id" > /dev/null  7.01s user 1.67s system 85% cpu 10.173 total

[this is my ginkgo script]. Most important is its setting of TMPDIR, which populates the cached images on first run. I use time rather than rely on ginkgo timing which could obscure internal bottlenecks.

Can't think of anything else to try right now.

@edsantiago
Copy link
Member

Slightly better luck with rmi with cached images`, which seems to be the slowest of the slow (111s in CI):

$ for i in "" "-remote"; do time sudo ./ginkgo $i "rmi with cached images" >/dev/null;done
sudo ./ginkgo $i "rmi with cached images" > /dev/null   7.88s user 2.50s system 110% cpu  9.375 total
sudo ./ginkgo $i "rmi with cached images" > /dev/null  13.31s user 3.97s system 119% cpu 14.485 total

Takes almost double the time, but still O(10s) not O(100s).

When I run it manually the test fails (not gonna even look at why), but the slow steps are the podman build ones. That's easy to spot interactively because the unbuffered output stalls.

Anyhow, FWIW.

@vrothberg
Copy link
Member Author

Thanks for the valuable insights, @edsantiago !

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Sep 12, 2022

Is this moving forward?

@vrothberg
Copy link
Member Author

@baude, did you find time to look further into this?

@edsantiago
Copy link
Member

Hi, just checking in. I ran cirrus-pr-timing on #15896, one of my PRs today, a simple one:

type distro user local remote container
int fedora-36 root 31:06 45:41 26:12
int ubuntu-2204 root 30:48 34:11
int fedora-36 rootless 28:52
int ubuntu-2204 rootless 26:15
sys fedora-36 root 30:16 27:47
sys fedora-36-aarch64 root 33:22 22:57
sys ubuntu-2204 root 26:31 15:44
sys fedora-36 rootless 29:51 27:10
sys ubuntu-2204 rootless 26:27
bud fedora-36 root 23:48 24:03

fedora int is still slow, but ubuntu int is now much closer between local and remote. Remote is still slower, but not preposterously so.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member

Here's my periodic cirrus-pr-timing checkin, this time on #16284:

type distro user local remote container
int fedora-36 root 26:54 29:30 25:43
int ubuntu-2204 root 25:37 31:48
int fedora-36 rootless 24:15
int ubuntu-2204 rootless 27:57
sys fedora-36 root 23:08 16:10
sys fedora-36-aarch64 root 26:15 17:45
sys ubuntu-2204 root 19:13 14:22
sys fedora-36 rootless 21:08 13:16
sys ubuntu-2204 rootless 19:35
bud fedora-36 root 24:26 25:58

Dramatic improvement in remote tests. Does anyone have a clue why? Am I going to need to bisect to figure out when things changed?

@vrothberg
Copy link
Member Author

@edsantiago a lot of performance improvements went in recently. Aren't we spinning up a new system service for each test? If so, I am pretty sure that the shutdown improvements add up.

@edsantiago
Copy link
Member

Good thinking... but that's not it. cirrus-pr-timing is almost instantaneous, so I decided to just bisect down the git history. And the evidence suggests that the winner is #15826 which brought in new VM images:

type distro user local remote container
int fedora-36 root 27:12 30:29 24:27
int ubuntu-2204 root 27:55 33:36
int fedora-36 rootless 27:26
int ubuntu-2204 rootless 25:23
sys fedora-36 root 23:46 16:18
sys fedora-36-aarch64 root 26:44 18:00
sys ubuntu-2204 root 20:05 14:55
sys fedora-36 rootless 22:04 13:41
sys ubuntu-2204 rootless 20:01
bud fedora-36 root 24:00 24:35

Subsequent PRs with that as merge-base have similar timings. This strongly suggests that the problem was in the VM images, not in podman itself. This is good (yay podman) and bad (we need to be SUPER CAREFUL now when switching VM images) (and, we probably need to spend some time to understand just what the heck the difference is).

@edsantiago
Copy link
Member

Because there's talk of bumping up VM images, here's a sample timing run from today (PR #17056):

type distro user local remote container
int fedora-36 root 29:15 30:44 28:32
int fedora-37 root 28:28 31:09 26:04
int fedora-36 rootless 25:29
int fedora-37 rootless 25:42
sys fedora-36 root 28:19 18:06
sys fedora-37 root 26:07 18:07
sys fedora-37-aarch64 root 29:01 19:35
sys fedora-36 rootless 24:36
sys fedora-37 rootless 24:48 15:43
bud fedora-37 root 26:48 26:33

@vrothberg
Copy link
Member Author

I think we can close the issue and open a new one "slow local e2e tests" 🤣

@edsantiago
Copy link
Member

Because there's no other place to keep track of PR timings, here's cirrus-pr-timing for 17589, a PR run with the latest CI VMs based on 17503 (the Debian one):

type distro user local remote container
int debian-12 root 26:31 34:34
int fedora-36 root 30:58 36:01 29:53
int fedora-37 root 29:39 33:02 28:16
int debian-12 rootless 23:34
int fedora-36 rootless 27:57
int fedora-37 rootless 29:10
sys debian-12 root 24:53 18:17
sys fedora-36 root 29:32 19:48
sys fedora-37 root 28:13 18:47
sys fedora-37-aarch64 root 32:16 21:06
sys debian-12 rootless 25:01
sys fedora-36 rootless 26:49
sys fedora-37 rootless 26:42 16:34
bud fedora-37 root 28:05 28:10

...looks pretty much the same to me, but I just eyeballed, did not script-diff.

@edsantiago
Copy link
Member

Continuing the tradition of using this issue as a central holding pen for PR timing results, here's a table that now includes SQLite results (pr 17855):

type distro user DB local remote container
int debian-12 root 27:02 34:58
int fedora-36 root 33:06 37:01 32:47
int fedora-37 root 29:00 35:01 27:22
int fedora-37 root sqlite 28:42 34:06
int debian-12 rootless 26:08
int fedora-36 rootless 29:15
int fedora-37 rootless 27:52
int fedora-37 rootless sqlite 28:04
sys debian-12 root 25:35 17:40
sys fedora-36 root 30:23 20:44
sys fedora-37 root 28:48 19:12
sys fedora-37-aarch64 root 32:56 21:57
sys debian-12 rootless 27:55
sys fedora-36 rootless 29:25
sys fedora-37 rootless 27:57 18:37
bud fedora-37 root 30:33 27:29

@vrothberg
Copy link
Member Author

I am personally happy with the current state of affairs. When creating the issues, the delta was really big but has significantly improved since then.

@edsantiago, out of curiosity. Do we have average/median time to run 1 test? Some tests are skipped for remote and some for local, so I am curious what the average times may look like.

@edsantiago
Copy link
Member

Do we have average/median time to run 1 test?

I'm not sure I understand your question. Like, do I gather statistics on all PRs and aggregate them? No, sorry. On my fantasy wishlist but much too tricky to spend serious time on.

Your "1 test" confuses me though, it seems to suggest something like averaging between root/rootless/remote and different distros?

@vrothberg
Copy link
Member Author

What I meant is time per test: $numberOfExecutedTests divided by $totalTime could be a starting point.

@vrothberg
Copy link
Member Author

Eeeh, the other way around for sure time/tests.

@edsantiago
Copy link
Member

I still don't understand, possibly because "test" is such an overloaded term. Do you mean "number of individual It() (e2e) or @test (bats) tests in one specific Cirrus CI task", such as, "it took 30:00 (1800s) to run 250 tests, therefore, each test ran in a mean time of 7.2 seconds"? That's kind of hard to calculate (requires downloading full logs for each test), and I'm not sure it's meaningful given the huge variance in individual tests.

If that's not what you mean, could you give me an example? Sorry to be so dense today.

@vrothberg
Copy link
Member Author

That's kind of hard to calculate (requires downloading full logs for each test), and I'm not sure it's meaningful given the huge variance in individual tests.

Yes, that's what I am interested in. But I see this as nice to have and do not think at all it's critical.

@edsantiago
Copy link
Member

We've got a problem on debian. Remote e2e tests are consistently taking 49m to run (compare to 39-42 for all fedoras). Which means that every so often they exceed 50m and get killed by the Cirrus timeout. I will not have time to investigate this until next week.

  • debian-12 : int remote debian-12 root host sqlite [remote]
    • 06-15 09:08 in int remote debian-12 root host sqlite: Timeout
    • 06-14 18:59 in int remote debian-12 root host sqlite: Timeout
    • 06-14 11:38 in int remote debian-12 root host sqlite: Timeout
    • 05-27 16:58 in int remote debian-12 root host sqlite: Timeout
    • 05-26 10:51 in int remote debian-12 root host sqlite: Timeout

@vrothberg
Copy link
Member Author

Thanks for sharing, @edsantiago.

Smells like a deadlock.

Cc: @mheon

@mheon
Copy link
Member

mheon commented Jun 16, 2023

I'm looking for PRs of examples of int remote debian-12 root host sqlite where it hasn't failed (but almost has), but I can't actually find it running on any of our PRs? We have a int remote debian-12 root host boltdb but seemingly no sqlite version?

@edsantiago
Copy link
Member

sqlite is because all of those are #17831, my hammer-away PR. debian+sqlite is not a real-world case. Sorry for the confusion. Here's a 46-minute run, from my last test run yesterday.

This is nothing to do with sqlite, though. Debian-remote is in the 46-to-49-minute range (super slow) on all CI runs.

@mheon
Copy link
Member

mheon commented Jun 16, 2023

Looking at the test timings in the linked run - there are a few long tests, but not an unreasonable amount, and even the longest is only a bit over a minute. It feels like every test being run is just slow - it feels like we have hundreds of tests taking 2-3 seconds, for example. So this is a systemic issue, not individual tests running really poorly on Debian.

Given this I really doubt we're looking at deadlocks. I wonder if it's something to do with podman system service - maybe it's particularly slow to start on Debian and adding a lot of overhead?

@vrothberg
Copy link
Member Author

I agree. Doesn´t look like a deadlock or a sqlite fart.

Need to track down what's different with the Debian images. I recall that Go on Ubuntu was once responsible for that.

@baude
Copy link
Member

baude commented Sep 7, 2023

I'm going to close this as i think, in spirit, we addressed the original complaint. Specific issues should now be filed and we can fix them.

@baude baude closed this as completed Sep 7, 2023
@edsantiago
Copy link
Member

Thank you for noticing this. Yes, it can be closed, we now have parity between remote & local e2e tests:

type distro user DB local remote container
int debian-13 root 35:29 36:42
int fedora-37 root 37:37 36:55 38:41
int fedora-38 root 35:23 35:33 34:57
int fedora-38 root sqlite 40:08 36:47
int rawhide root sqlite 35:23 38:04
int debian-13 rootless 34:07
int fedora-37 rootless 35:20
int fedora-38 rootless 33:58
int fedora-38 rootless sqlite 36:18
int rawhide rootless sqlite 37:35
sys debian-13 root 29:39 21:01
sys fedora-37 root 33:42 22:09
sys fedora-38 root 33:25 21:40
sys fedora-38 root sqlite 33:41 22:24
sys fedora-38-aarch64 root 29:22 19:50
sys rawhide root sqlite 30:45 20:44
sys debian-13 rootless 29:49
sys fedora-37 rootless 32:08
sys fedora-38 rootless 32:01 19:38
sys fedora-38 rootless sqlite 33:19
sys rawhide rootless sqlite 32:02
bud fedora-38 root 29:01 29:11

(source: #19887, a recently-closed PR)

@vrothberg
Copy link
Member Author

Great work everyone!

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Dec 7, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

5 participants