Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][Green-Ray][1] Automated retry of infra-error release tests #34057

Merged
merged 36 commits into from
Apr 17, 2023

Conversation

can-anyscale
Copy link
Collaborator

@can-anyscale can-anyscale commented Apr 4, 2023

Why are these changes needed?

This might or might not be a controversial PR, so any feedback here is super welcome.

This PR is a part of my effort to make OSS release test run greener, starting with reducing infra error rates. Other work such as this from Lonnie fixes systematic issues such as unstable Anyscale staging environment. This PR addresses transient issues with Anyscale that are hard to avoid in a distributed system. On a day Anyscale behaves well, transient issue seem to be around 2-3%, aka. 4 random failures for a test suite of 200 tests, annoying!

Concretely it will:

  • First, classify an infra test run as a transient infra issue
  • Instruct buildkite to automatically retry on transient issue
  • If retry runs out, classify the infra test run as infra issue

Some other limitations that will be addressed in followup PRs:

  • Move infra-failure retry configuration into LaunchDarkly?
  • Limit auto-retry based on test cost or test runtime

Checks

@can-anyscale can-anyscale changed the base branch from master to can-ci-clean-03 April 4, 2023 18:20
@can-anyscale can-anyscale changed the base branch from can-ci-clean-03 to master April 4, 2023 18:20
@can-anyscale can-anyscale changed the title Can ci errors [CI][Green-Ray] Automated retry of infra-error release tests Apr 5, 2023
@can-anyscale can-anyscale changed the title [CI][Green-Ray] Automated retry of infra-error release tests [CI][Green-Ray][1] Automated retry of infra-error release tests Apr 5, 2023
@can-anyscale can-anyscale changed the base branch from master to can-ci-clean-04 April 5, 2023 17:34
@can-anyscale can-anyscale marked this pull request as ready for review April 5, 2023 17:39
@can-anyscale can-anyscale requested a review from a team as a code owner April 5, 2023 17:39
Base automatically changed from can-ci-clean-04 to master April 12, 2023 18:53
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need most of the refactoring here.

We already expose exit codes that are much more descriptive than here, e.g.:

    # Hard infra errors (non-retryable)
    CLI_ERROR = 10
    CONFIG_ERROR = 11
    SETUP_ERROR = 12
    CLUSTER_RESOURCE_ERROR = 13
    CLUSTER_ENV_BUILD_ERROR = 14
    CLUSTER_STARTUP_ERROR = 15
    LOCAL_ENV_SETUP_ERROR = 16
    REMOTE_ENV_SETUP_ERROR = 17
    FETCH_RESULT_ERROR = 18
    ANYSCALE_ERROR = 19

    # Infra timeouts (retryable)
    RAY_WHEELS_TIMEOUT = 30
    CLUSTER_ENV_BUILD_TIMEOUT = 31
    CLUSTER_STARTUP_TIMEOUT = 32
    CLUSTER_WAIT_TIMEOUT = 33

Wouldn't it be enough to set

- exit_status: 30
  limit: 3
- exit_status: 31
  limit: 3

etc?

I'd like to have clarity on this before merging so requesting changes for now

release/ray_release/buildkite/step.py Outdated Show resolved Hide resolved
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think the retry logic exists in a similar matter in run_release_test.sh - only that we we retry there in the same job and not in a repeated job.

Would it make sense to adjust that script to return a "retry" or "non-retry" exit code to buildkite? That way we can retain the existing exit codes in the python script and move buildkite-specific logic into the buildkite wrapper.

@can-anyscale
Copy link
Collaborator Author

@krfricke ah nice, I didn't notice retry logic exists in the wrapper as well. I still kind of like a different result.status (currently timeout, error, infra_error, infra_timeout) for jobs that are retried, as they are stored in databricks, for further analysis and tracking purpose though. So I can keep most of the logic to update result.status and result.last_logs (lot of refactoring here is to capture this for infra-error cases), keep the existing return code the same and move buildkite return code/retry to the wrapper. How do you like that? Or is it easier to understand to keep the retry logic in the same place as the computation of result.status.

@krfricke
Copy link
Contributor

Yeah that sounds good to me!

Another thing, our dashboards currently use the status string to parse statuses, we should make sure that we keep the existing names for backwards compatibility or adjust the dashboards to be compatible.

E.g. go/cd

@can-anyscale
Copy link
Collaborator Author

@krfricke: awesome, I'll make sure to update the dashboard too!

@can-anyscale can-anyscale marked this pull request as draft April 13, 2023 19:50
@can-anyscale can-anyscale marked this pull request as ready for review April 13, 2023 20:39
@can-anyscale
Copy link
Collaborator Author

Good for review now. Re-tested as well: https://buildkite.com/ray-project/release-tests-pr/builds/34922. Thanks for reviewing @krfricke

pipeline_exception = None
# non critical for some tests. So separate it from the general one.
fetch_result_exception = None
try:
buildkite_group(":spiral_note_pad: Loading test configuration")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the block under try-catch so all errors are handled through handle_exception

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG! Thanks for the revision.

Ping me to merge

@@ -173,4 +175,8 @@ if [ -z "${NO_CLONE}" ]; then
rm -rf "${TMPDIR}" || true
fi

exit $EXIT_CODE
if [ "$REASON" == "infra error" ] || [ "$REASON" == "infra timeout" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if [ "$REASON" == "infra error" ] || [ "$REASON" == "infra timeout" ]; then
if [ "$REASON" == "infra_error" ] || [ "$REASON" == "infra_timeout" ]; then

underscores missing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are without underscore; obtained from the reason() function on top of this file ;)

@@ -94,7 +94,7 @@ def test_repeat(setup):
ExitCode.CLUSTER_WAIT_TIMEOUT,
ExitCode.RAY_WHEELS_TIMEOUT,
)
== ExitCode.RAY_WHEELS_TIMEOUT.value
== 79
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
== 79
== 79 # BUILDKITE_RETRY_CODE

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally!

Signed-off-by: Cuong Nguyen <[email protected]>
@krfricke krfricke merged commit 20aa7b7 into master Apr 17, 2023
@krfricke krfricke deleted the can-ci-errors branch April 17, 2023 17:37
krfricke pushed a commit that referenced this pull request Apr 18, 2023
…34110)

In #34057, I made it so far release tests that fail with infra-error will automatically retry once. This PR makes it so that, not only it has to fail with infra-error, it has to run within less than 30 minutes as well.

Signed-off-by: Cuong Nguyen <[email protected]>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…project#34057)

This PR is a part of my effort to make OSS release test run greener, starting with reducing infra error rates. Other work such as [this from Lonnie](https://docs.google.com/document/d/1hF7h8F19qFWFxH9WVeT8fWwVuNyUyHLTx-7LP3uxD50/edit#heading=h.i0cvl0u8jbfu) fixes systematic issues such as unstable Anyscale staging environment. This PR addresses transient issues with Anyscale that are hard to avoid in a distributed system. On a day Anyscale behaves well, transient issue seem to be around [2-3%](https://b534fd88.us1a.app.preset.io/superset/dashboard/43/?force=false&native_filters_key=MoYaGptJfGwbkF60A7RSzfoRLL_ypDf_JvNFxp2YGQ8Ls4CNgbAWEBh0WcOkOLsS), aka. 4 random failures for a test suite of 200 tests, annoying!

Concretely it will:

- First, classify an infra test run as a transient infra issue
- Instruct buildkite to automatically retry on transient issue
- If retry runs out, classify the infra test run as infra issue

Some other limitations that will be addressed in followup PRs:
- Move infra-failure retry configuration into LaunchDarkly?
- Limit auto-retry based on test cost or test runtime

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: elliottower <[email protected]>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…ay-project#34110)

In ray-project#34057, I made it so far release tests that fail with infra-error will automatically retry once. This PR makes it so that, not only it has to fail with infra-error, it has to run within less than 30 minutes as well.

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: elliottower <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…project#34057)

This PR is a part of my effort to make OSS release test run greener, starting with reducing infra error rates. Other work such as [this from Lonnie](https://docs.google.com/document/d/1hF7h8F19qFWFxH9WVeT8fWwVuNyUyHLTx-7LP3uxD50/edit#heading=h.i0cvl0u8jbfu) fixes systematic issues such as unstable Anyscale staging environment. This PR addresses transient issues with Anyscale that are hard to avoid in a distributed system. On a day Anyscale behaves well, transient issue seem to be around [2-3%](https://b534fd88.us1a.app.preset.io/superset/dashboard/43/?force=false&native_filters_key=MoYaGptJfGwbkF60A7RSzfoRLL_ypDf_JvNFxp2YGQ8Ls4CNgbAWEBh0WcOkOLsS), aka. 4 random failures for a test suite of 200 tests, annoying!

Concretely it will:

- First, classify an infra test run as a transient infra issue
- Instruct buildkite to automatically retry on transient issue
- If retry runs out, classify the infra test run as infra issue

Some other limitations that will be addressed in followup PRs:
- Move infra-failure retry configuration into LaunchDarkly?
- Limit auto-retry based on test cost or test runtime

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Jack He <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…ay-project#34110)

In ray-project#34057, I made it so far release tests that fail with infra-error will automatically retry once. This PR makes it so that, not only it has to fail with infra-error, it has to run within less than 30 minutes as well.

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Jack He <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants