Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report ThresholdsHaveFailed on Cloud runs #3876

Merged
merged 2 commits into from
Aug 22, 2024
Merged

Conversation

joanlopez
Copy link
Contributor

@joanlopez joanlopez commented Jul 30, 2024

(Note: It looks like there is probably a couple of scenarios more that could also be reported more concretely, but I prefer to open a separate pull request for each, so we can discuss them independently)

What?

It handles the exceptional case when testProgress.RunStatus == cloudapi.RunStatusFinished but with testProgress.ResultStatus == cloudapi.ResultStatusFailed, which means that the test run failed due to thresholds being crossed, in order to exit with the specific error code (ThresholdsHaveFailed = 99).

Before

> k6 cloud playground/script.js
...
...
...

     test status: Finished

Run    [======================================] Finished
ERRO[0044] The test has failed                          
exit status 97

After

> k6 cloud playground/script.js
...
...
...

     test status: Finished

Run    [======================================] Finished
ERRO[0042] Thresholds have been crossed                 
exit status 99

Why?

Currently, when a test run executed in the Grafana Cloud k6 fails due to thresholds being crossed, it is reported as a generic cloud error (CloudTestRunFailed = 97), giving no clue to the user about why the test run did fail.

In contraposition, when the same test run is executed locally, the user receives some meaningful feedback and the specific error code (ThresholdsHaveFailed = 99).

All that said makes, imho, the whole user experience worse, because it's not only inconsistent, and prevents users from reacting on those exit codes, but also because the current feedback is that generic (see below), that gives no clue at all to the user about why the test run did fail:

Run    [======================================] Finished
ERRO[0044] The test has failed                          
exit status 97

Checklist

  • I have performed a self-review of my code.
  • I have added tests for my changes.
  • I have run linter locally (make lint) and all checks pass.
  • I have run tests locally (make tests) and all tests pass.
  • I have commented on my code, particularly in hard-to-understand areas.

Related PR(s)/Issue(s)

@joanlopez joanlopez added this to the v0.54.0 milestone Jul 30, 2024
@joanlopez joanlopez self-assigned this Jul 30, 2024
@joanlopez joanlopez requested a review from a team as a code owner July 30, 2024 07:38
@joanlopez joanlopez requested review from codebien and removed request for a team July 30, 2024 07:38
// Although by looking at [ResultStatus] and [RunStatus] isn't self-explanatory,
// the scenario when the test run has finished, but it failed is an exceptional case for those situations
// when thresholds have been crossed (failed). So, we report this situation as such.
if testProgress.RunStatus == cloudapi.RunStatusFinished {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain about this assumption. During test run it possible also that an error during execution will happen and or it could be aborted by limits, like the comment below stays, so just being in a run state doesn't guaranty that the test has been aborted by thresholds.

However, I must admit I don't know how does the response from backend look like, have you chatted with them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I got it correctly, we have RunStatuses specifically for those situations (see https://github.com/grafana/k6/blob/master/cloudapi/run_status.go#L16-L20).

Also, my assumption is that RunStatusAbortedThreshold doesn't apply here, because in this case the test hasn't been aborted (abortOnFail: true), just finished and failed at the end, because unsuccessful threshold checks.

And yes, that's what I got after a conversation with @d14c and @Griatch, so maybe they can confirm it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I know there's other edge cases that we may want to handle within this piece of code (other of the aformentioned RunStatuses), but as I left as note in the PR description, I'd prefer to just focus on this edge case here, and leave the other scenarios for upcoming, separated PRs.

Copy link
Contributor

@codebien codebien Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @olegbespalov that sounds like a risky assumption. Furthermore, I have the feeling that all the time that we need to read this code, it will required a lot of cognitive load.

Could we ask the backend to add a dedicated RunStatus for it? Like: RunStatusFinishedWithCrossedThresholds.

Copy link

@Griatch Griatch Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joanlopez A few points here:

  • ABORTED_BY_THRESHOLD this is a run_status end status for thresholds that are set to abort the test when crossed. In this case the test will never reach run_status FINISHED.
  • run_status FINISHED can still be reached if the threshold was not set up to abort the test. In this case the outcome will only be reflected in the result_status which is handled completely separately from run_status.
  • result_status False/True will start out as True and will switch to False only on a failed threshold afaik. So it's for example possible to have a test that has a run_status ABORTED_SYSTEM but with result_status True because the threshold was never crossed.

For the upcoming public API we are introducing a 'success' status that is a little easier to reason around. Inviting @fornfrey on this to make sure I get it right; but it's basically a combination of run/result status that will fail both on a threshold-cross and on a test that fail to complete properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I got it right, it looks like my assumptions are confirmed again by @Griatch.

What do you think guys? @olegbespalov @codebien
Do you want to go like this? Or wait until we have the new API with clearer information?

If we prefer to wait, I can close this PR, and create an issue with some context, and marked as blocked, waiting for the new public API. As you prefer!

Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all input! it sounds all clear and ledgit and it seems like we could clean the TODO

// TODO: use different exit codes for failed thresholds vs failed test (e.g. aborted by system/limit)

If we prefer to wait, I can close this PR, and create an issue with some context, and marked as blocked, waiting for the new public API. As you prefer!

I don't know about ETA of new API, but if we could bring value earlier that sounds better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my question is how likely it is:

  • that the test has finished
  • fails
  • and it isn't a threshold

I guess it is fine if we have some false possitives, but still.

Copy link
Contributor Author

@joanlopez joanlopez Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that what previous comment from @Griatch explains? Like:

result_status False/True will start out as True and will switch to False only on a failed threshold afaik.

I think that, in case there's any other error (e.g. aborted by system), that's reflected on the run_status, and not on the result_status, which seems to be tied to thresholds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I guess I did skim over that particular line a bit too much.

Again this seems like a strange behaviour, but I guess it is what it is .

Copy link
Contributor

@mstoykov mstoykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general!

Although the API seems strange and I have added one more question in the discussion.

// Although by looking at [ResultStatus] and [RunStatus] isn't self-explanatory,
// the scenario when the test run has finished, but it failed is an exceptional case for those situations
// when thresholds have been crossed (failed). So, we report this situation as such.
if testProgress.RunStatus == cloudapi.RunStatusFinished {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my question is how likely it is:

  • that the test has finished
  • fails
  • and it isn't a threshold

I guess it is fine if we have some false possitives, but still.

@joanlopez
Copy link
Contributor Author

Although the API seems strange

Yeah, I agree, because result_status looks quite generic, while it seems to be completely tied to thresholds.
So, it's not trivial to guess what happened based on the information you just get from the API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants