Skip to content
This repository has been archived by the owner on Dec 26, 2022. It is now read-only.

Add retries to flaky web integration tests. #2066

Merged
merged 1 commit into from
Oct 19, 2022

Conversation

inickles-grapl
Copy link
Contributor

@inickles-grapl inickles-grapl commented Oct 19, 2022

Which issue does this PR correspond to?

We're seeing intermittent 500 errors returned from the Consul sidecar in the grapl-web-ui integration tests.

These often occur after a test has already successfully made requests to grapl-web-ui, suggesting that service has had enough time to come up as healthy.

https://github.com/grapl-security/issue-tracker/issues/1008

What changes does this PR make to Grapl? Why?

This adds retry logic to the grapl-web-ui integration tests. Wen a request returns a 500 error, the request will retry up to ten times, waiting a second between each retry attempt.

How were these changes tested?

CI

Copy link
Contributor

@jgrillo-grapl jgrillo-grapl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the slack conversations that precipitated this PR correctly, the flakiness we're observing is the result of a timing issue where these tests are running before grapl-web-ui is ready to serve requests.

If that's the case, I'd recommend fixing the flakiness temporarily by adding a big healthy sleep to the start of each test function, and prioritizing a longer term project to make sure grapl-web-ui doesn't report healthy until it's able to talk to its backend services--that is, until all its dependencies themselves report healthy.

Once grapl-web-ui has robust healthchecks, I'd recommend in each test function to block on grapl-web-ui reporting healthy, then run the test. Until such a time, a simple sleep should do the job.

Unless I'm totally misunderstanding the reason for this PR, which is entirely possible...

@inickles-grapl inickles-grapl force-pushed the inickles/neutralize-flaky-web-tests branch from df04c8a to cb0af7d Compare October 19, 2022 01:42
@codecov
Copy link

codecov bot commented Oct 19, 2022

Codecov Report

Base: 39.39% // Head: 39.38% // Decreases project coverage by -0.00% ⚠️

Coverage data is based on head (f708d15) compared to base (7228b83).
Patch has no changes to coverable lines.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2066      +/-   ##
==========================================
- Coverage   39.39%   39.38%   -0.01%     
==========================================
  Files         436      436              
  Lines       10155    10155              
==========================================
- Hits         4001     4000       -1     
- Misses       6154     6155       +1     
Impacted Files Coverage Δ
src/rust/sysmon-parser/src/util.rs 40.93% <0.00%> (-0.68%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@inickles-grapl
Copy link
Contributor Author

If I understand the slack conversations that precipitated this PR correctly, the flakiness we're observing is the result of a timing issue where these tests are running before grapl-web-ui is ready to serve requests.

That's not it, a number of the failures we've seen are in a call to api/plugin/get_metadata, which uses data in the response from a previous and successful call to api/plugin/create.

Notes and logs can be found here: https://github.com/grapl-security/issue-tracker/issues/1008

If that's the case, I'd recommend fixing the flakiness temporarily by adding a big healthy sleep to the start of each test function, and prioritizing a longer term project to make sure grapl-web-ui doesn't report healthy until it's able to talk to its backend services--that is, until all its dependencies themselves report healthy.

Once grapl-web-ui has robust healthchecks, I'd recommend in each test function to block on grapl-web-ui reporting healthy, then run the test. Until such a time, a simple sleep should do the job.

Unless I'm totally misunderstanding the reason for this PR, which is entirely possible...

We already have a sleep for just this reason: https://github.com/grapl-security/grapl/blob/main/src/rust/Dockerfile#L263-L292.

Because a number of the failures we're seeing happen after a test has made successful requests already it suggests it's not due to this initial timeout being too small.

@inickles-grapl inickles-grapl force-pushed the inickles/neutralize-flaky-web-tests branch 2 times, most recently from 1ce4955 to 86b7e9c Compare October 19, 2022 15:14
@inickles-grapl inickles-grapl changed the title Neutralize flaky web integration tests. Add retries to flaky web integration tests. Oct 19, 2022
@inickles-grapl inickles-grapl force-pushed the inickles/neutralize-flaky-web-tests branch 3 times, most recently from 3e37bff to c4c411b Compare October 19, 2022 19:37
@ghost
Copy link

ghost commented Oct 19, 2022

Apparently, the next version of Consul (1.14) will add support for setting up retry policies on Envoy per hashicorp/consul#12890. Once we're on that version, I'd like to set that up and potentially back out of this

@inickles-grapl inickles-grapl force-pushed the inickles/neutralize-flaky-web-tests branch from c4c411b to 9fc57d1 Compare October 19, 2022 21:47
@inickles-grapl inickles-grapl force-pushed the inickles/neutralize-flaky-web-tests branch from 9fc57d1 to f708d15 Compare October 19, 2022 22:16
@inickles-grapl inickles-grapl merged commit 2e5b328 into main Oct 19, 2022
@inickles-grapl inickles-grapl deleted the inickles/neutralize-flaky-web-tests branch October 19, 2022 22:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants