-
Notifications
You must be signed in to change notification settings - Fork 78
Add retries to flaky web integration tests. #2066
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand the slack conversations that precipitated this PR correctly, the flakiness we're observing is the result of a timing issue where these tests are running before grapl-web-ui is ready to serve requests.
If that's the case, I'd recommend fixing the flakiness temporarily by adding a big healthy sleep to the start of each test function, and prioritizing a longer term project to make sure grapl-web-ui doesn't report healthy until it's able to talk to its backend services--that is, until all its dependencies themselves report healthy.
Once grapl-web-ui has robust healthchecks, I'd recommend in each test function to block on grapl-web-ui reporting healthy, then run the test. Until such a time, a simple sleep should do the job.
Unless I'm totally misunderstanding the reason for this PR, which is entirely possible...
df04c8a
to
cb0af7d
Compare
Codecov ReportBase: 39.39% // Head: 39.38% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #2066 +/- ##
==========================================
- Coverage 39.39% 39.38% -0.01%
==========================================
Files 436 436
Lines 10155 10155
==========================================
- Hits 4001 4000 -1
- Misses 6154 6155 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
That's not it, a number of the failures we've seen are in a call to Notes and logs can be found here: https://github.com/grapl-security/issue-tracker/issues/1008
We already have a sleep for just this reason: https://github.com/grapl-security/grapl/blob/main/src/rust/Dockerfile#L263-L292. Because a number of the failures we're seeing happen after a test has made successful requests already it suggests it's not due to this initial timeout being too small. |
1ce4955
to
86b7e9c
Compare
3e37bff
to
c4c411b
Compare
Apparently, the next version of Consul (1.14) will add support for setting up retry policies on Envoy per hashicorp/consul#12890. Once we're on that version, I'd like to set that up and potentially back out of this |
c4c411b
to
9fc57d1
Compare
9fc57d1
to
f708d15
Compare
Which issue does this PR correspond to?
We're seeing intermittent 500 errors returned from the Consul sidecar in the grapl-web-ui integration tests.
These often occur after a test has already successfully made requests to grapl-web-ui, suggesting that service has had enough time to come up as healthy.
https://github.com/grapl-security/issue-tracker/issues/1008
What changes does this PR make to Grapl? Why?
This adds retry logic to the grapl-web-ui integration tests. Wen a request returns a 500 error, the request will retry up to ten times, waiting a second between each retry attempt.
How were these changes tested?
CI