Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky TestInspect #42251

Merged
merged 1 commit into from
Apr 5, 2021
Merged

Fix flaky TestInspect #42251

merged 1 commit into from
Apr 5, 2021

Conversation

thaJeztah
Copy link
Member

(hopefully) fixes #40651

This test has been flaky for a long time, failing with:

--- FAIL: TestInspect (12.04s)
    inspect_test.go:39: timeout hit after 10s: waiting for tasks to enter run state. task failed with error: task: non-zero exit (1)

While looking through logs, noticed tasks were started, entering RUNNING stage,
and then exited, to be started again.

state.transition="STARTING->RUNNING"
...
msg="fatal task error" error="task: non-zero exit (1)"
...
state.transition="RUNNING->FAILED"

Looking for possible reasons, first considering network issues (possibly we ran
out of IP addresses or networking not cleaned up), then I spotted the issue.

The service is started with;

Command:         []string{"/bin/top"},
Args:            []string{"-u", "root"},

The -u root is not an argument for the service, but for /bin/top. While the
Ubuntu/Debian/GNU version top has a -u/-U option;

docker run --rm ubuntu:20.04 top -h 2>&1 | grep '\-u'
  top -hv | -bcEHiOSs1 -d secs -n max -u|U user -p pid(s) -o field -w [cols]

The busybox version of top does not:

docker run --rm busybox top --help 2>&1 | grep '\-u'

So running top -u root would cause the task to fail;

docker run --rm busybox top -u root
top: invalid option -- u
...

echo $?
1

As a result, the service went into a crash-loop, and because the poll.WaitOn()
was running with a short interval, in many cases would just find the RUNNING
state, perform the service inspect, and pass, but in other cases, it would not
be that lucky, and continue polling untill we reached the 10 seconds timeout,
and mark the test as failed.

Looking for history of this option (was it previously using a different image?) I
found this was added in 6cd6d86 (#33684), but probably
just missed during review.

Given that the option is only set to have "something" to inspect, I replaced
the -u root with -d 5, which makes top refresh with a 5 second interval.

Note that there is another test (`TestServiceListWithStatuses) that uses the same
spec, however, that test is skipped based on API version of the test-daemon, and
(to be looked into), when performing that check, no API version is known, causing
the test to (always?) be skipped:

=== RUN   TestServiceListWithStatuses
    --- SKIP: TestServiceListWithStatuses (0.00s)
        list_test.go:34: versions.LessThan(testEnv.DaemonInfo.ServerVersion, "1.41")

- A picture of a cute animal (not mandatory but encouraged)

acting-like-animals-owwwww-ice-cream-headache

This test has been flaky for a long time, failing with:

    --- FAIL: TestInspect (12.04s)
        inspect_test.go:39: timeout hit after 10s: waiting for tasks to enter run state. task failed with error: task: non-zero exit (1)

While looking through logs, noticed tasks were started, entering RUNNING stage,
and then exited, to be started again.

    state.transition="STARTING->RUNNING"
    ...
    msg="fatal task error" error="task: non-zero exit (1)"
    ...
    state.transition="RUNNING->FAILED"

Looking for possible reasons, first considering network issues (possibly we ran
out of IP addresses or networking not cleaned up), then I spotted the issue.

The service is started with;

    Command:         []string{"/bin/top"},
    Args:            []string{"-u", "root"},

The `-u root` is not an argument for the service, but for `/bin/top`. While the
Ubuntu/Debian/GNU version `top` has a -u/-U option;

    docker run --rm ubuntu:20.04 top -h 2>&1 | grep '\-u'
      top -hv | -bcEHiOSs1 -d secs -n max -u|U user -p pid(s) -o field -w [cols]

The *busybox* version of top does not:

    docker run --rm busybox top --help 2>&1 | grep '\-u'

So running `top -u root` would cause the task to fail;

    docker run --rm busybox top -u root
    top: invalid option -- u
    ...

    echo $?
    1

As a result, the service went into a crash-loop, and because the `poll.WaitOn()`
was running with a short interval, in many cases would _just_ find the RUNNING
state, perform the `service inspect`, and pass, but in other cases, it would not
be that lucky, and continue polling untill we reached the 10 seconds timeout,
and mark the test as failed.

Looking for history of this option (was it previously using a different image?) I
found this was added in 6cd6d86, but probably
just missed during review.

Given that the option is only set to have "something" to inspect, I replaced
the `-u root` with `-d 5`, which makes top refresh with a 5 second interval.

Note that there is another test (`TestServiceListWithStatuses) that uses the same
spec, however, that test is skipped based on API version of the test-daemon, and
(to be looked into), when performing that check, no API version is known, causing
the test to (always?) be skipped:

    === RUN   TestServiceListWithStatuses
        --- SKIP: TestServiceListWithStatuses (0.00s)
            list_test.go:34: versions.LessThan(testEnv.DaemonInfo.ServerVersion, "1.41")

Signed-off-by: Sebastiaan van Stijn <[email protected]>
@tiborvass
Copy link
Contributor

@thaJeztah our hero.

TBH not sure if it has to be top, could be sleep as far as i'm concerned

Copy link
Member

@tianon tianon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: TestInspect
5 participants