Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

freezer: add delay after freeze #2941

Merged

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented May 6, 2021

I hate to keep adding those kludges (for earlier ones, see #2918, #2791, #2774)
but lately TestFreeze (and TestSystemdFreeze) from libcontainer/integration
fails a lot (see #2907).

The failure comes and goes, and is probably this is caused by a slow host
allocated for the test, and a slow VM on top of it.

To remediate, add a small sleep on every 25th iteration in between
asking the kernel to freeze and checking its status.

In the worst case scenario (failure to freeze), this adds about 0.4 ms
(40 x 10 us) to the duration of the call.

It is hard to measure how this affects CI as GHA plays a roulette when
allocating a node to run the test on, but it seems to help. With
additional debug info, I saw somewhat frequent "frozen after 24 retries"
or "frozen after 49 retries", meaning it succeeded right after the added
sleep.

While at it, rewrite/improve the comments.

Fixes: #2907.

@@ -65,6 +65,10 @@ func (s *FreezerGroup) Set(path string, r *configs.Resources) (Err error) {
return err
}

if i%25 == 24 {
// A short sleep before reading back also helps.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the comment to clarify this helps what

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a big comment above telling the whole story...

Ah, I just found it now contradicts what I say here. Will fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments rewritten; PTAL @AkihiroSuda

@kolyshkin
Copy link
Contributor Author

This is from a "good" VM (once you start testing, gha gives you a good VM...).

Out of 400 runs, only 147 needed more than 1 retry.

Out of those that need a retry, there are peaks at 24 and 49, meaning sleep helps.

[kir@kir-rhat runc]$ grep frozen xx | awk '{print $5}' | sort -n | uniq -c
      8 2
      7 3
      4 4
      2 5
      1 6
      2 7
      2 8
      1 10
      1 13
      1 14
      1 19
      4 22
      1 23
     85 24
      1 25
      1 29
      1 30
      1 31
      1 36
      1 39
      1 41
      1 45
     15 49
      1 50
      1 52
      1 54
      1 123

@dqminh
Copy link
Contributor

dqminh commented May 6, 2021

LGTM

I hate to keep adding those kludges, but lately TestFreeze (and
TestSystemdFreeze) from libcontainer/integration fails a lot. The
failure comes and goes, and is probably this is caused by a slow host
allocated for the test, and a slow VM on top of it.

To remediate, add a small sleep on every 25th iteration in between
asking the kernel to freeze and checking its status.

In the worst case scenario (failure to freeze), this adds about 0.4 ms
(40 x 10 us) to the duration of the call.

It is hard to measure how this affects CI as GHA plays a roulette when
allocating a node to run the test on, but it seems to help. With
additional debug info, I saw somewhat frequent "frozen after 24 retries"
or "frozen after 49 retries", meaning it succeeded right after the added
sleep.

While at it, rewrite/improve the comments.

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin kolyshkin force-pushed the another-kludge-to-fix-freeze branch from b787703 to 524abc5 Compare May 6, 2021 17:52
@kolyshkin kolyshkin added this to the 1.0.0-rc94 milestone May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[centos7] Test{,Systemd}Freeze FAIL: unexpected error: unable to freeze
4 participants