slow remote e2e tests #15272

vrothberg · 2022-08-10T15:45:54Z

The remote e2e tests are noticeably slower (see table below) than the local ones. I some cases up by a factor of two. Speeding the e2e tests up would increase the feedback loop of CI and probably reduce the costs of CI.

type	distro	user	local	remote	container
int	fedora-36	root	30:01	48:17	29:23
int	ubuntu-2204	root	30:27	52:12

For a moment I though that the remote tests would load the cache via the remote client but as it turned out in #15266 the remote tests are loading the local tar files via the local client already.

Assigning to @baude who raised interest in taking a closer look.

edsantiago · 2022-08-10T19:56:28Z

I noticed a lot of yellow SLOW TEST warnings while poking around on one of my PRs today, and:

$ cirrus-pr-timing --download 15275
...
$ cd cirrus-pr-timing.5394951126122496
$ grep -c 'SLOW TEST' int-*-root-*    (formatted for legibility)
int-podman-fedora-36-root-container  : 128
int-podman-fedora-36-root-host       : 149
int-podman-ubuntu-2204-root-host     : 185
int-remote-fedora-36-root-host       : 509   <---
int-remote-ubuntu-2204-root-host     : 506   <---

Obvious next step is trying to reproduce on my laptop... but no luck:

$ for i in "" "-remote"; do time sudo ./ginkgo $i "pause a checkpointed container by id" >/dev/null;done
sudo ./ginkgo $i "pause a checkpointed container by id" > /dev/null  7.08s user 1.67s system 79% cpu 11.016 total
sudo ./ginkgo $i "pause a checkpointed container by id" > /dev/null  7.01s user 1.67s system 85% cpu 10.173 total

[this is my ginkgo script]. Most important is its setting of TMPDIR, which populates the cached images on first run. I use time rather than rely on ginkgo timing which could obscure internal bottlenecks.

Can't think of anything else to try right now.

edsantiago · 2022-08-10T20:08:40Z

Slightly better luck with rmi with cached images`, which seems to be the slowest of the slow (111s in CI):

$ for i in "" "-remote"; do time sudo ./ginkgo $i "rmi with cached images" >/dev/null;done
sudo ./ginkgo $i "rmi with cached images" > /dev/null   7.88s user 2.50s system 110% cpu  9.375 total
sudo ./ginkgo $i "rmi with cached images" > /dev/null  13.31s user 3.97s system 119% cpu 14.485 total

Takes almost double the time, but still O(10s) not O(100s).

When I run it manually the test fails (not gonna even look at why), but the slow steps are the podman build ones. That's easy to spot interactively because the unbuffered output stalls.

Anyhow, FWIW.

vrothberg · 2022-08-11T06:20:33Z

Thanks for the valuable insights, @edsantiago !

github-actions · 2022-09-11T00:08:34Z

A friendly reminder that this issue had no activity for 30 days.

rhatdan · 2022-09-12T19:20:19Z

Is this moving forward?

vrothberg · 2022-09-13T07:45:27Z

@baude, did you find time to look further into this?

edsantiago · 2022-09-22T19:48:12Z

Hi, just checking in. I ran cirrus-pr-timing on #15896, one of my PRs today, a simple one:

type	distro	user	local	remote	container
int	fedora-36	root	31:06	45:41	26:12
int	ubuntu-2204	root	30:48	34:11
int	fedora-36	rootless	28:52
int	ubuntu-2204	rootless	26:15
sys	fedora-36	root	30:16	27:47
sys	fedora-36-aarch64	root	33:22	22:57
sys	ubuntu-2204	root	26:31	15:44
sys	fedora-36	rootless	29:51	27:10
sys	ubuntu-2204	rootless	26:27
bud	fedora-36	root	23:48	24:03

fedora int is still slow, but ubuntu int is now much closer between local and remote. Remote is still slower, but not preposterously so.

github-actions · 2022-10-23T00:12:51Z

A friendly reminder that this issue had no activity for 30 days.

edsantiago · 2022-10-26T12:52:09Z

Here's my periodic cirrus-pr-timing checkin, this time on #16284:

type	distro	user	local	remote	container
int	fedora-36	root	26:54	29:30	25:43
int	ubuntu-2204	root	25:37	31:48
int	fedora-36	rootless	24:15
int	ubuntu-2204	rootless	27:57
sys	fedora-36	root	23:08	16:10
sys	fedora-36-aarch64	root	26:15	17:45
sys	ubuntu-2204	root	19:13	14:22
sys	fedora-36	rootless	21:08	13:16
sys	ubuntu-2204	rootless	19:35
bud	fedora-36	root	24:26	25:58

Dramatic improvement in remote tests. Does anyone have a clue why? Am I going to need to bisect to figure out when things changed?

vrothberg · 2022-10-27T11:05:29Z

@edsantiago a lot of performance improvements went in recently. Aren't we spinning up a new system service for each test? If so, I am pretty sure that the shutdown improvements add up.

edsantiago · 2022-10-27T12:32:49Z

Good thinking... but that's not it. cirrus-pr-timing is almost instantaneous, so I decided to just bisect down the git history. And the evidence suggests that the winner is #15826 which brought in new VM images:

type	distro	user	local	remote	container
int	fedora-36	root	27:12	30:29	24:27
int	ubuntu-2204	root	27:55	33:36
int	fedora-36	rootless	27:26
int	ubuntu-2204	rootless	25:23
sys	fedora-36	root	23:46	16:18
sys	fedora-36-aarch64	root	26:44	18:00
sys	ubuntu-2204	root	20:05	14:55
sys	fedora-36	rootless	22:04	13:41
sys	ubuntu-2204	rootless	20:01
bud	fedora-36	root	24:00	24:35

Subsequent PRs with that as merge-base have similar timings. This strongly suggests that the problem was in the VM images, not in podman itself. This is good (yay podman) and bad (we need to be SUPER CAREFUL now when switching VM images) (and, we probably need to spend some time to understand just what the heck the difference is).

edsantiago · 2023-01-10T21:27:03Z

Because there's talk of bumping up VM images, here's a sample timing run from today (PR #17056):

type	distro	user	local	remote	container
int	fedora-36	root	29:15	30:44	28:32
int	fedora-37	root	28:28	31:09	26:04
int	fedora-36	rootless	25:29
int	fedora-37	rootless	25:42
sys	fedora-36	root	28:19	18:06
sys	fedora-37	root	26:07	18:07
sys	fedora-37-aarch64	root	29:01	19:35
sys	fedora-36	rootless	24:36
sys	fedora-37	rootless	24:48	15:43
bud	fedora-37	root	26:48	26:33

vrothberg · 2023-01-11T08:11:12Z

I think we can close the issue and open a new one "slow local e2e tests" 🤣

edsantiago · 2023-02-23T13:34:27Z

Because there's no other place to keep track of PR timings, here's cirrus-pr-timing for 17589, a PR run with the latest CI VMs based on 17503 (the Debian one):

type	distro	user	local	remote	container
int	debian-12	root	26:31	34:34
int	fedora-36	root	30:58	36:01	29:53
int	fedora-37	root	29:39	33:02	28:16
int	debian-12	rootless	23:34
int	fedora-36	rootless	27:57
int	fedora-37	rootless	29:10
sys	debian-12	root	24:53	18:17
sys	fedora-36	root	29:32	19:48
sys	fedora-37	root	28:13	18:47
sys	fedora-37-aarch64	root	32:16	21:06
sys	debian-12	rootless	25:01
sys	fedora-36	rootless	26:49
sys	fedora-37	rootless	26:42	16:34
bud	fedora-37	root	28:05	28:10

...looks pretty much the same to me, but I just eyeballed, did not script-diff.

edsantiago · 2023-03-20T20:54:52Z

Continuing the tradition of using this issue as a central holding pen for PR timing results, here's a table that now includes SQLite results (pr 17855):

type	distro	user	DB	local	remote	container
int	debian-12	root		27:02	34:58
int	fedora-36	root		33:06	37:01	32:47
int	fedora-37	root		29:00	35:01	27:22
int	fedora-37	root	sqlite	28:42	34:06
int	debian-12	rootless		26:08
int	fedora-36	rootless		29:15
int	fedora-37	rootless		27:52
int	fedora-37	rootless	sqlite	28:04
sys	debian-12	root		25:35	17:40
sys	fedora-36	root		30:23	20:44
sys	fedora-37	root		28:48	19:12
sys	fedora-37-aarch64	root		32:56	21:57
sys	debian-12	rootless		27:55
sys	fedora-36	rootless		29:25
sys	fedora-37	rootless		27:57	18:37
bud	fedora-37	root		30:33	27:29

vrothberg · 2023-04-11T13:23:50Z

I am personally happy with the current state of affairs. When creating the issues, the delta was really big but has significantly improved since then.

@edsantiago, out of curiosity. Do we have average/median time to run 1 test? Some tests are skipped for remote and some for local, so I am curious what the average times may look like.

edsantiago · 2023-04-11T13:37:19Z

Do we have average/median time to run 1 test?

I'm not sure I understand your question. Like, do I gather statistics on all PRs and aggregate them? No, sorry. On my fantasy wishlist but much too tricky to spend serious time on.

Your "1 test" confuses me though, it seems to suggest something like averaging between root/rootless/remote and different distros?

vrothberg · 2023-04-11T13:39:15Z

What I meant is time per test: $numberOfExecutedTests divided by $totalTime could be a starting point.

vrothberg · 2023-04-11T13:40:42Z

Eeeh, the other way around for sure time/tests.

edsantiago · 2023-04-11T13:54:32Z

I still don't understand, possibly because "test" is such an overloaded term. Do you mean "number of individual It() (e2e) or @test (bats) tests in one specific Cirrus CI task", such as, "it took 30:00 (1800s) to run 250 tests, therefore, each test ran in a mean time of 7.2 seconds"? That's kind of hard to calculate (requires downloading full logs for each test), and I'm not sure it's meaningful given the huge variance in individual tests.

If that's not what you mean, could you give me an example? Sorry to be so dense today.

vrothberg · 2023-04-11T13:56:37Z

That's kind of hard to calculate (requires downloading full logs for each test), and I'm not sure it's meaningful given the huge variance in individual tests.

Yes, that's what I am interested in. But I see this as nice to have and do not think at all it's critical.

edsantiago · 2023-06-15T20:20:11Z

We've got a problem on debian. Remote e2e tests are consistently taking 49m to run (compare to 39-42 for all fedoras). Which means that every so often they exceed 50m and get killed by the Cirrus timeout. I will not have time to investigate this until next week.

debian-12 : int remote debian-12 root host sqlite [remote]
- 06-15 09:08 in int remote debian-12 root host sqlite: Timeout
- 06-14 18:59 in int remote debian-12 root host sqlite: Timeout
- 06-14 11:38 in int remote debian-12 root host sqlite: Timeout
- 05-27 16:58 in int remote debian-12 root host sqlite: Timeout
- 05-26 10:51 in int remote debian-12 root host sqlite: Timeout

vrothberg · 2023-06-16T06:47:17Z

Thanks for sharing, @edsantiago.

Smells like a deadlock.

Cc: @mheon

mheon · 2023-06-16T13:29:28Z

I'm looking for PRs of examples of int remote debian-12 root host sqlite where it hasn't failed (but almost has), but I can't actually find it running on any of our PRs? We have a int remote debian-12 root host boltdb but seemingly no sqlite version?

edsantiago · 2023-06-16T13:36:36Z

sqlite is because all of those are #17831, my hammer-away PR. debian+sqlite is not a real-world case. Sorry for the confusion. Here's a 46-minute run, from my last test run yesterday.

This is nothing to do with sqlite, though. Debian-remote is in the 46-to-49-minute range (super slow) on all CI runs.

mheon · 2023-06-16T13:45:19Z

Looking at the test timings in the linked run - there are a few long tests, but not an unreasonable amount, and even the longest is only a bit over a minute. It feels like every test being run is just slow - it feels like we have hundreds of tests taking 2-3 seconds, for example. So this is a systemic issue, not individual tests running really poorly on Debian.

Given this I really doubt we're looking at deadlocks. I wonder if it's something to do with podman system service - maybe it's particularly slow to start on Debian and adding a lot of overhead?

vrothberg · 2023-06-16T14:02:23Z

I agree. Doesn´t look like a deadlock or a sqlite fart.

Need to track down what's different with the Debian images. I recall that Go on Ubuntu was once responsible for that.

baude · 2023-09-07T14:53:34Z

I'm going to close this as i think, in spirit, we addressed the original complaint. Specific issues should now be filed and we can fix them.

edsantiago · 2023-09-07T15:00:34Z

Thank you for noticing this. Yes, it can be closed, we now have parity between remote & local e2e tests:

type	distro	user	DB	local	remote	container
int	debian-13	root		35:29	36:42
int	fedora-37	root		37:37	36:55	38:41
int	fedora-38	root		35:23	35:33	34:57
int	fedora-38	root	sqlite	40:08	36:47
int	rawhide	root	sqlite	35:23	38:04
int	debian-13	rootless		34:07
int	fedora-37	rootless		35:20
int	fedora-38	rootless		33:58
int	fedora-38	rootless	sqlite	36:18
int	rawhide	rootless	sqlite	37:35
sys	debian-13	root		29:39	21:01
sys	fedora-37	root		33:42	22:09
sys	fedora-38	root		33:25	21:40
sys	fedora-38	root	sqlite	33:41	22:24
sys	fedora-38-aarch64	root		29:22	19:50
sys	rawhide	root	sqlite	30:45	20:44
sys	debian-13	rootless		29:49
sys	fedora-37	rootless		32:08
sys	fedora-38	rootless		32:01	19:38
sys	fedora-38	rootless	sqlite	33:19
sys	rawhide	rootless	sqlite	32:02
bud	fedora-38	root		29:01	29:11

(source: #19887, a recently-closed PR)

vrothberg · 2023-09-07T15:05:22Z

Great work everyone!

vrothberg assigned baude Aug 10, 2022

github-actions bot added the stale-issue label Sep 11, 2022

github-actions bot removed the stale-issue label Sep 13, 2022

github-actions bot added the stale-issue label Oct 23, 2022

edsantiago removed the stale-issue label Oct 26, 2022

edsantiago mentioned this issue Jan 11, 2023

update CI images to include pasta #16525

Merged

edsantiago mentioned this issue Apr 11, 2023

chore(deps): update dependency containers/automation_images to v20230405 #18124

Merged

1 task

edsantiago mentioned this issue Jun 27, 2023

CI: debian remote startup is very slow #19010

Closed

baude closed this as completed Sep 7, 2023

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Dec 7, 2023

github-actions bot locked as resolved and limited conversation to collaborators Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow remote e2e tests #15272

slow remote e2e tests #15272

vrothberg commented Aug 10, 2022

edsantiago commented Aug 10, 2022

edsantiago commented Aug 10, 2022

vrothberg commented Aug 11, 2022

github-actions bot commented Sep 11, 2022

rhatdan commented Sep 12, 2022

vrothberg commented Sep 13, 2022

edsantiago commented Sep 22, 2022

github-actions bot commented Oct 23, 2022

edsantiago commented Oct 26, 2022

vrothberg commented Oct 27, 2022

edsantiago commented Oct 27, 2022

edsantiago commented Jan 10, 2023

vrothberg commented Jan 11, 2023

edsantiago commented Feb 23, 2023

edsantiago commented Mar 20, 2023

vrothberg commented Apr 11, 2023

edsantiago commented Apr 11, 2023

vrothberg commented Apr 11, 2023

vrothberg commented Apr 11, 2023

edsantiago commented Apr 11, 2023

vrothberg commented Apr 11, 2023

edsantiago commented Jun 15, 2023

vrothberg commented Jun 16, 2023

mheon commented Jun 16, 2023

edsantiago commented Jun 16, 2023

mheon commented Jun 16, 2023

vrothberg commented Jun 16, 2023

baude commented Sep 7, 2023

edsantiago commented Sep 7, 2023

vrothberg commented Sep 7, 2023

slow remote e2e tests #15272

slow remote e2e tests #15272

Comments

vrothberg commented Aug 10, 2022

edsantiago commented Aug 10, 2022

edsantiago commented Aug 10, 2022

vrothberg commented Aug 11, 2022

github-actions bot commented Sep 11, 2022

rhatdan commented Sep 12, 2022

vrothberg commented Sep 13, 2022

edsantiago commented Sep 22, 2022

github-actions bot commented Oct 23, 2022

edsantiago commented Oct 26, 2022

vrothberg commented Oct 27, 2022

edsantiago commented Oct 27, 2022

edsantiago commented Jan 10, 2023

vrothberg commented Jan 11, 2023

edsantiago commented Feb 23, 2023

edsantiago commented Mar 20, 2023

vrothberg commented Apr 11, 2023

edsantiago commented Apr 11, 2023

vrothberg commented Apr 11, 2023

vrothberg commented Apr 11, 2023

edsantiago commented Apr 11, 2023

vrothberg commented Apr 11, 2023

edsantiago commented Jun 15, 2023

vrothberg commented Jun 16, 2023

mheon commented Jun 16, 2023

edsantiago commented Jun 16, 2023

mheon commented Jun 16, 2023

vrothberg commented Jun 16, 2023

baude commented Sep 7, 2023

edsantiago commented Sep 7, 2023

vrothberg commented Sep 7, 2023