-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deflake integration tests #1599
Conversation
f4f6c39
to
4c566aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvements @ldez 👍 Left a few comments.
Also:
- We could go even farther on this, mimicing a bit what I started to do on docker integration request package — i.e. instead of requiring the test to create the
Request
, etc.. just callRequest(…)
with the correct arguments. - We could try to make
Try
more composable (so you could doTry(…, Request(…), conditions..)
), etc.. - Finaly, as it's a testing package, we could even make the
Try
function take at *testing.T
as first argument, and callingt.Fatal
in case of error 👼
integration/access_log_test.go
Outdated
@@ -12,6 +12,7 @@ import ( | |||
"strings" | |||
"time" | |||
|
|||
"github.com/containous/traefik/integration/utils" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this isn't related to this PR, but integration/try
would make more sense (and removing Try
from function names)
integration/utils/try.go
Outdated
return nil | ||
} | ||
|
||
stopTime := time.Now().Add(timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stopTimer := time.NewTimer(timeout)
retryTick := time.NewTicker(interval)
for {
select {
case <-stopTimer.C:
err = fmt.Errorf("try operation failed: %s", err)
break
case <-retryTick.C:
if err = operation(); err == nil {
// do your thing
break
}
}
}
// This also can be done inside the loop, in the retryTick.C case
stopTimer.Stop()
retryTick.Stop()
return nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code doesn't work -> infinite timeout when the operation failed.
for {
select {
case <-time.After(timeout):
return fmt.Errorf("try operation failed: %s", err)
case <-time.Tick(wait):
if err = operation(); err == nil {
return nil
}
}
}
-> infinite timeout when the operation failed.
stopTimer := time.NewTimer(timeout)
retryTick := time.NewTicker(interval)
for {
select {
case <-stopTimer.C:
err = fmt.Errorf("try operation failed: %s", err)
stopTimer.Stop()
retryTick.Stop()
return err
case <-retryTick.C:
if err = operation(); err == nil {
stopTimer.Stop()
retryTick.Stop()
return err
}
}
}
-> works but not very beautiful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code doesn't work -> infinite timeout when the operation failed.
Why would it ? at some point (timeout
) stopTimer.C
will get the message and we get out of the infinit loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know exactly why but it produce an infinite loop.
it produce 2 loops: one stop at the timeout and one infinite.
integration/utils/try.go
Outdated
interval = maxInterval | ||
} | ||
|
||
ci := os.Getenv("CI") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not a huge fan of that. We shouldn't have a special case for CI. If we need a 3x
multiplier for timeout in CI, let's just set all the timeout to 3x
their value..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this doesn't work because we don't access to global env var in the integration test.
But why I have add this special case?
I don't want to wait 3 min in my local environment to known if a test fail, but in the CI some tests are more slow than a personal computer, we need to have a pragmatic approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this doesn't work because we don't access to global env var in the integration test.
hum.. 🤔 we have access to whichever the environment variable used to execute the tests, so we just need to pass them.
But why I have add this special case?
I don't want to wait 3 min in my local environment to known if a test fail, but in the CI some tests are more slow than a personal computer, we need to have a pragmatic approach.
hum, I don't really buy that but meh I won't fight again, it's just timeout stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this doesn't work because we don't access to global env var in the integration test.
hum.. 🤔 we have access to whichever the environment variable used to execute the tests, so we just need to pass them.
in our repo via .semaphoreci/vars we can pass global vars to the CI, but in this case the issue is the overlay docker level (which adds a filter).
To be visible during the tests, The CI global vars need to be passed in the make file like that: atbore-phx@6aef257
CI log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not a huge fan of that. We shouldn't have a special case for CI. If we need a 3x multiplier for timeout in CI, let's just set all the timeout to 3x their value..
I'm agree "in theory" with this sentence if we are speaking about "timeout" stuff,
but i feel, I'm missing something, just to understand, why do we wait 3 min more, locally ? the value is a timeout so a max "break" value. isn't it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already experienced cases where I had run into integration test timeouts locally only when my laptop was running under high CPU load. Other machines with less or more resources could behave totally differently.
This is really a classical case of YMMV. I'm with @vdemeester that we should just swallow the pill and increase the timeouts globally. For the vast majority of cases, we won't feel the slowdown because the tests will complete quickly enough. For those cases where we experience lots of long-running failing tests, you'd probably start them individually anyway because nobody is going to wait 60, 120, or 180 seconds for a whole bunch of failing integration tests to complete.
Finally, the purpose of this PR is to reduce flakiness, and not waiting long enough is a source of flakiness. I'd rather eliminate this risk and pay the price that things will be a bit slower than necessary if things go south temporarily.
integration/utils/try.go
Outdated
|
||
// Sleep pauses the current goroutine for at least the duration d. | ||
// Use only when use an other Try[...] functions is not possible. | ||
func Sleep(d time.Duration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't have a special case for CI, this doesn't make sense. We should be able to replace any time.Sleep
from the tests code with a Try
or sthg that is better than a plain old Sleep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, in some case we really need to use time.Sleep
, the Try
logical cannot solve all case.
We need a second way with a stuff like: "Retry again for X time"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loads of cases should/can be handled with Try
or something that acting almost the same.
If we Sleep
:
- when waiting for the process to run, we should use
Try
with a condition (i.e. something that checks the process is running, or some port is open, or for some output of the command to appear, or ?) - when waiting for a k/v value to be there, same, we should put the get into a condition
The less we use Sleep
, the better really 👼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/containous/traefik/blob/master/integration/marathon_test.go#L39
This case need a continue
approach, try
doesn't work in this case.
PS: this test file seems to test nothing, because all request seems always give the same response, look at the comment block (I have failed to make this work).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first case (https://github.com/containous/traefik/blob/master/integration/marathon_test.go#L39) can handled with a Try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also look at https://github.com/containous/traefik/blob/master/integration/docker_test.go#L180
Yes, this one could/should use TryRequest
(with the condition on getting a 404
or sthg)
integration/utils/try.go
Outdated
// Conditions are not allowed since it would complicate signaling if the | ||
// response body needs to be closed or not. Callers are expected to close on | ||
// their own if the function returns a nil error. | ||
func TryGetResponseUntilStatusCode(url string, timeout time.Duration, statusCode int) (*http.Response, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is not used elsewhere
integration/utils/try.go
Outdated
} | ||
return nil | ||
} | ||
} | ||
|
||
func doGetResponse(url string, timeout time.Duration, condition Condition) (*http.Response, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is not required, let's just pass the http method
to doResponse
and use only doRespose
integration/consul_catalog_test.go
Outdated
|
||
_, err = ioutil.ReadAll(resp.Body) | ||
cond := utils.ComposeCondition(utils.UntilStatusCodeIs(200), utils.BodyContains("Welcome to nginx!")) | ||
err = utils.TryRequest(req, 5*time.Second, cond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have make TryRequest
taking variadic args, so that you could do TryRequest(req, 5*time.Second, utils.UntilStatusCodeIs…, utils.Body…)
.
integration/healthcheck_test.go
Outdated
|
||
resp, err = client.Do(req) | ||
utils.Sleep(3 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about that one, at all. What are we testing here : that the health was 500
before (it's tested above) and that in a given period of time, it will be 200
. I don't see why we re-verify 500
, wait for 3s and then wait for 200
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to wait the time to do the check of the "healthcheck"
integration/healthcheck_test.go
Outdated
c.Assert(err, checker.IsNil) | ||
} | ||
|
||
utils.Sleep(3 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost the same here, if we want to wait 3s and then have the TryRequest
, let's just set a bigger timeout for the Try
.
when you think this PR reaches a steady state, just ping me and i could execute a series on the CI to verify if the Error Probability is decreasing like that: #1600. |
It's the first step, I will create the testing package after. |
f5cbc19
to
c6c98da
Compare
integration/access_log_test.go
Outdated
@@ -71,14 +79,14 @@ func (s *AccessLogSuite) TestAccessLog(c *check.C) { | |||
tokens, err := shellwords.Parse(line) | |||
c.Assert(err, checker.IsNil) | |||
c.Assert(len(tokens), checker.Equals, 13) | |||
c.Assert(tokens[6], checker.Equals, "200") | |||
c.Assert(regexp.MustCompile("^\\d\\d\\d$").MatchString(tokens[6]), checker.True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please use backticks instead of quotes to avoid extra escaping.
- We should be able to specify directly how many digits we expect, i.e.,
\d{3}
.
integration/acme_test.go
Outdated
@@ -9,34 +9,24 @@ import ( | |||
|
|||
"github.com/go-check/check" | |||
|
|||
"errors" | |||
"github.com/containous/traefik/integration/utils" | |||
"github.com/containous/traefik/integration/try" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should go above line 10 to sort alphabetically.
@@ -65,12 +55,9 @@ func (s *AcmeSuite) TestRetrieveAcmeCertificate(c *check.C) { | |||
client := &http.Client{Transport: tr} | |||
|
|||
// wait for traefik (generating acme account take some seconds) | |||
err = utils.Try(30*time.Second, func() error { | |||
err = try.Do(90*time.Second, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use Nevermind, sorry.try.GetRequest
?
integration/consul_catalog_test.go
Outdated
leader, err := consulClient.Status().Leader() | ||
|
||
if err != nil || len(leader) == 0 { | ||
return fmt.Errorf("Leader not find. %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/find/found/
integration/consul_catalog_test.go
Outdated
|
||
_, err = ioutil.ReadAll(resp.Body) | ||
err = try.Request(req, 5*time.Second, try.StatusCodeIs(200), try.BodyContains("Welcome to nginx!")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the extra BodyContains
check? I'm worried it'll break if we change the Nginx image.
200 should be good enough, shouldn't it?
integration/healthcheck_test.go
Outdated
c.Assert(err, checker.IsNil) | ||
|
||
time.Sleep(time.Second * 3) | ||
// Verify frontend health : before | ||
err = try.Request(req, 3*time.Second, try.StatusCodeIs(500)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this.
integration/healthcheck_test.go
Outdated
|
||
resp, err = client.Do(req) | ||
try.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want another query check at this point rather making sure whoami1Host
's health check returns 200 again.
integration/healthcheck_test.go
Outdated
try.Sleep(2 * time.Second) | ||
|
||
// Verify frontend health : 500 | ||
err = try.Request(req, 3*time.Second, try.StatusCodeIs(500)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...this should not be needed anymore.
integration/healthcheck_test.go
Outdated
resp, err = client.Do(req) | ||
try.Sleep(2 * time.Second) | ||
|
||
// Verify frontend health : after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the block starting from here to line 91 (incl.) isn't needed as we just want the single, static request starting at line 93.
integration/https_test.go
Outdated
@@ -25,7 +26,7 @@ func (s *HTTPSSuite) TestWithSNIConfigHandshake(c *check.C) { | |||
c.Assert(err, checker.IsNil) | |||
defer cmd.Process.Kill() | |||
|
|||
time.Sleep(500 * time.Millisecond) | |||
try.Sleep(500 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS, this should be the only try.Sleep
call left after my comments regarding the other ones are addressed. What is this one waiting for? Isn't this just another "wait for Traefik to start" check that we tend to solve via a retriable request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is already a tremendously good PR @ldez. 👏
I think there are still 3 open points:
- Health check test: I think the implementation is still susceptible to flakiness (primarily because of the
try.Sleep
usage) and that there's room for simplification. Please see my latest comments. try.Sleep
: I'm quite confident we don't need them and can actually do better. I have made suggestions how to replace all of them, except for one where I'm not totally sure but it looks replaceable too.- CI env var: As outlined, I'm not really in favor of it. It's the least important point to me, though.
} | ||
|
||
// Waiting for Traefik healthcheck | ||
try.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to drop this one if we just increase the timeout in line 65 a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted all sleep except this one and refactor the whole test.
integration/healthcheck_test.go
Outdated
time.Sleep(time.Second * 3) | ||
|
||
resp, err = client.Do(req) | ||
// Verify frontend health : before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the remaining block up until line 89 can be simplified as following:
- Try-request for a 200 response code. Once we succeed, we know for sure that the first whoami host has successfully toggled its health status back to 200. Also, this way we don't have to rely on the
try.Sleep
in line 79. - Try-request for a 500 response code. This verifies that the second whoami host is still unhealthy.
Paraphrased, we can remove the section between line 74 and line 80.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this PR is so amazing @ldez
LGTM 👏 👏 👏
integration/acme_test.go
Outdated
return nil | ||
}) | ||
|
||
err := try.GetRequest("http://"+s.boulderIP+":4000/directory", 120*time.Second, try.StatusCodeIs(200)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sed/200/http.StatusOK eventually ?
|
||
return nil | ||
}) | ||
c.Assert(err, checker.IsNil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Note that this is the reason why I think try.Do(t, 3*time.Second, …)
would fit perfectly. But can be done in follow-ups.
@@ -66,7 +63,7 @@ func (s *EurekaSuite) TestSimpleConfiguration(c *check.C) { | |||
} | |||
}` | |||
|
|||
tmpl, err := template.New("eurekaTemlate").Parse(eurekaTemplate) | |||
tmpl, err := template.New("eurekaTemplate").Parse(eurekaTemplate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😱
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few small comments, but can definitely merged as is 😎 🎉 👏
LGTM 🦁 (probably not with 54 commits though 😝)
|
||
return nil | ||
}) | ||
c.Assert(err, checker.IsNil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Note that this is the reason why I think try.Do(t, 3*time.Second, …)
would fit perfectly. But can be done in follow-ups.
@@ -66,7 +63,7 @@ func (s *EurekaSuite) TestSimpleConfiguration(c *check.C) { | |||
} | |||
}` | |||
|
|||
tmpl, err := template.New("eurekaTemlate").Parse(eurekaTemplate) | |||
tmpl, err := template.New("eurekaTemplate").Parse(eurekaTemplate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😱
// ResponseCondition is a retry condition function. | ||
// It receives a response, and returns an error | ||
// if the response failed the condition. | ||
type ResponseCondition func(*http.Response) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could be unexported (but lint
or vet
might be not happy, not sure)
|
||
// DoCondition is a retry condition function. | ||
// It returns an error | ||
type DoCondition func() error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit (same as above): could be unexported (but lint
or vet
might be not happy, not sure)
|
||
timeout = applyCIMultiplier(timeout) | ||
|
||
var err error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I don't think we need to declare it before hand.
bd3a0f9
to
2004de6
Compare
- feat: add CI multiplier - refactor: readability - feat: custom Sleep function - refactor(integration): use custom Sleep - feat: show Try progress - feat(try): try response with status code - refactor(try): use a dedicate package. - refactor(integration): Try everywhere - feat(CI): pass CI env var to Integration Tests. - refactor(acme): increase timeout. - feat(acme): show Traefik logs - refactor(integration): use `http.StatusXXX` - refactor: remove Sleep
2004de6
to
59a48fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description
Base on #1284