[CI][Green-Ray][1] Automated retry of infra-error release tests #34057

can-anyscale · 2023-04-04T18:19:53Z

Why are these changes needed?

This might or might not be a controversial PR, so any feedback here is super welcome.

This PR is a part of my effort to make OSS release test run greener, starting with reducing infra error rates. Other work such as this from Lonnie fixes systematic issues such as unstable Anyscale staging environment. This PR addresses transient issues with Anyscale that are hard to avoid in a distributed system. On a day Anyscale behaves well, transient issue seem to be around 2-3%, aka. 4 random failures for a test suite of 200 tests, annoying!

Concretely it will:

First, classify an infra test run as a transient infra issue
Instruct buildkite to automatically retry on transient issue
If retry runs out, classify the infra test run as infra issue

Some other limitations that will be addressed in followup PRs:

Move infra-failure retry configuration into LaunchDarkly?
Limit auto-retry based on test cost or test runtime

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
Testing Strategy
- Unit tests
- Release tests, tests are auto-retried on infra issued ONCE - https://buildkite.com/ray-project/release-tests-pr/builds/34062

krfricke

I'm not sure if we need most of the refactoring here.

We already expose exit codes that are much more descriptive than here, e.g.:

    # Hard infra errors (non-retryable)
    CLI_ERROR = 10
    CONFIG_ERROR = 11
    SETUP_ERROR = 12
    CLUSTER_RESOURCE_ERROR = 13
    CLUSTER_ENV_BUILD_ERROR = 14
    CLUSTER_STARTUP_ERROR = 15
    LOCAL_ENV_SETUP_ERROR = 16
    REMOTE_ENV_SETUP_ERROR = 17
    FETCH_RESULT_ERROR = 18
    ANYSCALE_ERROR = 19

    # Infra timeouts (retryable)
    RAY_WHEELS_TIMEOUT = 30
    CLUSTER_ENV_BUILD_TIMEOUT = 31
    CLUSTER_STARTUP_TIMEOUT = 32
    CLUSTER_WAIT_TIMEOUT = 33

Wouldn't it be enough to set

- exit_status: 30
  limit: 3
- exit_status: 31
  limit: 3

etc?

I'd like to have clarity on this before merging so requesting changes for now

release/ray_release/buildkite/step.py

krfricke

Actually, I think the retry logic exists in a similar matter in run_release_test.sh - only that we we retry there in the same job and not in a repeated job.

Would it make sense to adjust that script to return a "retry" or "non-retry" exit code to buildkite? That way we can retain the existing exit codes in the python script and move buildkite-specific logic into the buildkite wrapper.

can-anyscale · 2023-04-13T14:26:01Z

@krfricke ah nice, I didn't notice retry logic exists in the wrapper as well. I still kind of like a different result.status (currently timeout, error, infra_error, infra_timeout) for jobs that are retried, as they are stored in databricks, for further analysis and tracking purpose though. So I can keep most of the logic to update result.status and result.last_logs (lot of refactoring here is to capture this for infra-error cases), keep the existing return code the same and move buildkite return code/retry to the wrapper. How do you like that? Or is it easier to understand to keep the retry logic in the same place as the computation of result.status.

krfricke · 2023-04-13T15:34:02Z

Yeah that sounds good to me!

Another thing, our dashboards currently use the status string to parse statuses, we should make sure that we keep the existing names for backwards compatibility or adjust the dashboards to be compatible.

E.g. go/cd

can-anyscale · 2023-04-13T16:31:20Z

@krfricke: awesome, I'll make sure to update the dashboard too!

can-anyscale · 2023-04-13T20:40:27Z

Good for review now. Re-tested as well: https://buildkite.com/ray-project/release-tests-pr/builds/34922. Thanks for reviewing @krfricke

can-anyscale · 2023-04-13T20:41:35Z

release/ray_release/glue.py

    pipeline_exception = None
    # non critical for some tests. So separate it from the general one.
    fetch_result_exception = None
    try:
+        buildkite_group(":spiral_note_pad: Loading test configuration")


Move the block under try-catch so all errors are handled through handle_exception

Signed-off-by: Cuong Nguyen <[email protected]>

krfricke

LG! Thanks for the revision.

Ping me to merge

krfricke · 2023-04-17T13:28:05Z

release/run_release_test.sh

@@ -173,4 +175,8 @@ if [ -z "${NO_CLONE}" ]; then
  rm -rf "${TMPDIR}" || true
 fi

-exit $EXIT_CODE
+if [ "$REASON" == "infra error" ] || [ "$REASON" == "infra timeout" ]; then


Suggested change

if [ "$REASON" == "infra error" ] || [ "$REASON" == "infra timeout" ]; then

if [ "$REASON" == "infra_error" ] || [ "$REASON" == "infra_timeout" ]; then

underscores missing?

they are without underscore; obtained from the reason() function on top of this file ;)

krfricke · 2023-04-17T13:28:23Z

release/ray_release/tests/test_run_script.py

@@ -94,7 +94,7 @@ def test_repeat(setup):
            ExitCode.CLUSTER_WAIT_TIMEOUT,
            ExitCode.RAY_WHEELS_TIMEOUT,
        )
-        == ExitCode.RAY_WHEELS_TIMEOUT.value
+        == 79


Suggested change

== 79

== 79 # BUILDKITE_RETRY_CODE

Signed-off-by: Cuong Nguyen <[email protected]>

…34110) In #34057, I made it so far release tests that fail with infra-error will automatically retry once. This PR makes it so that, not only it has to fail with infra-error, it has to run within less than 30 minutes as well. Signed-off-by: Cuong Nguyen <[email protected]>

…project#34057) This PR is a part of my effort to make OSS release test run greener, starting with reducing infra error rates. Other work such as [this from Lonnie](https://docs.google.com/document/d/1hF7h8F19qFWFxH9WVeT8fWwVuNyUyHLTx-7LP3uxD50/edit#heading=h.i0cvl0u8jbfu) fixes systematic issues such as unstable Anyscale staging environment. This PR addresses transient issues with Anyscale that are hard to avoid in a distributed system. On a day Anyscale behaves well, transient issue seem to be around [2-3%](https://b534fd88.us1a.app.preset.io/superset/dashboard/43/?force=false&native_filters_key=MoYaGptJfGwbkF60A7RSzfoRLL_ypDf_JvNFxp2YGQ8Ls4CNgbAWEBh0WcOkOLsS), aka. 4 random failures for a test suite of 200 tests, annoying! Concretely it will: - First, classify an infra test run as a transient infra issue - Instruct buildkite to automatically retry on transient issue - If retry runs out, classify the infra test run as infra issue Some other limitations that will be addressed in followup PRs: - Move infra-failure retry configuration into LaunchDarkly? - Limit auto-retry based on test cost or test runtime Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>

…ay-project#34110) In ray-project#34057, I made it so far release tests that fail with infra-error will automatically retry once. This PR makes it so that, not only it has to fail with infra-error, it has to run within less than 30 minutes as well. Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>

…project#34057) This PR is a part of my effort to make OSS release test run greener, starting with reducing infra error rates. Other work such as [this from Lonnie](https://docs.google.com/document/d/1hF7h8F19qFWFxH9WVeT8fWwVuNyUyHLTx-7LP3uxD50/edit#heading=h.i0cvl0u8jbfu) fixes systematic issues such as unstable Anyscale staging environment. This PR addresses transient issues with Anyscale that are hard to avoid in a distributed system. On a day Anyscale behaves well, transient issue seem to be around [2-3%](https://b534fd88.us1a.app.preset.io/superset/dashboard/43/?force=false&native_filters_key=MoYaGptJfGwbkF60A7RSzfoRLL_ypDf_JvNFxp2YGQ8Ls4CNgbAWEBh0WcOkOLsS), aka. 4 random failures for a test suite of 200 tests, annoying! Concretely it will: - First, classify an infra test run as a transient infra issue - Instruct buildkite to automatically retry on transient issue - If retry runs out, classify the infra test run as infra issue Some other limitations that will be addressed in followup PRs: - Move infra-failure retry configuration into LaunchDarkly? - Limit auto-retry based on test cost or test runtime Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Jack He <[email protected]>

…ay-project#34110) In ray-project#34057, I made it so far release tests that fail with infra-error will automatically retry once. This PR makes it so that, not only it has to fail with infra-error, it has to run within less than 30 minutes as well. Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Jack He <[email protected]>

can-anyscale changed the base branch from master to can-ci-clean-03 April 4, 2023 18:20

can-anyscale changed the base branch from can-ci-clean-03 to master April 4, 2023 18:20

can-anyscale force-pushed the can-ci-errors branch from bfed6cc to 9af76b8 Compare April 5, 2023 17:22

can-anyscale changed the title ~~Can ci errors~~ [CI][Green-Ray] Automated retry of infra-error release tests Apr 5, 2023

can-anyscale changed the title ~~[CI][Green-Ray] Automated retry of infra-error release tests~~ [CI][Green-Ray][1] Automated retry of infra-error release tests Apr 5, 2023

can-anyscale changed the base branch from master to can-ci-clean-04 April 5, 2023 17:34

can-anyscale marked this pull request as ready for review April 5, 2023 17:39

can-anyscale requested a review from a team as a code owner April 5, 2023 17:39

can-anyscale assigned krfricke, aslonnie, Yard1, scv119 and rkooo567 Apr 5, 2023

can-anyscale mentioned this pull request Apr 5, 2023

[CI][Green-Ray][2] Transient error release test needs to fail fast #34110

Merged

4 tasks

can-anyscale force-pushed the can-ci-clean-04 branch from fe5bab3 to 43f52c6 Compare April 10, 2023 16:24

can-anyscale force-pushed the can-ci-errors branch from 23bc4c7 to 72c0f9b Compare April 10, 2023 16:24

can-anyscale requested a review from aslonnie April 11, 2023 23:26

Base automatically changed from can-ci-clean-04 to master April 12, 2023 18:53

aslonnie approved these changes Apr 12, 2023

View reviewed changes

can-anyscale force-pushed the can-ci-errors branch from 12ce837 to 70b9852 Compare April 12, 2023 20:04

scv119 approved these changes Apr 12, 2023

View reviewed changes

krfricke requested changes Apr 13, 2023

View reviewed changes

release/ray_release/buildkite/step.py Outdated Show resolved Hide resolved

krfricke reviewed Apr 13, 2023

View reviewed changes

can-anyscale marked this pull request as draft April 13, 2023 19:50

can-anyscale marked this pull request as ready for review April 13, 2023 20:39

can-anyscale commented Apr 13, 2023

View reviewed changes

can-anyscale added 19 commits April 15, 2023 16:58

Fix another instance of null command_runner

23dfc53

Signed-off-by: Cuong Nguyen <[email protected]>

Fix retry count

5476c4a

Signed-off-by: Cuong Nguyen <[email protected]>

out of testing mode

0f18c42

Signed-off-by: Cuong Nguyen <[email protected]>

Name consistency

f26be2a

Signed-off-by: Cuong Nguyen <[email protected]>

Fix lints

1329baf

Signed-off-by: Cuong Nguyen <[email protected]>

Fix unit tests

477bda2

Signed-off-by: Cuong Nguyen <[email protected]>

Log command exceptions into result last_logs

6645788

Signed-off-by: Cuong Nguyen <[email protected]>

Raise an error for testing

557eb83

Signed-off-by: Cuong Nguyen <[email protected]>

Undo debugging code

c8eb53a

Signed-off-by: Cuong Nguyen <[email protected]>

Fix lints

6393b2f

Signed-off-by: Cuong Nguyen <[email protected]>

Move retry logic to sh file

cb4bd63

Signed-off-by: Cuong Nguyen <[email protected]>

Rebase

4cedf7a

Signed-off-by: Cuong Nguyen <[email protected]>

for testing

02765db

Signed-off-by: Cuong Nguyen <[email protected]>

More refactoring

73b305f

Signed-off-by: Cuong Nguyen <[email protected]>

Undo more changes

dec5849

Signed-off-by: Cuong Nguyen <[email protected]>

For testing

c047e48

Signed-off-by: Cuong Nguyen <[email protected]>

Fix sh file

02488a7

Signed-off-by: Cuong Nguyen <[email protected]>

Remove debugging info

93c7e81

Signed-off-by: Cuong Nguyen <[email protected]>

Fix tests

72ea4a1

Signed-off-by: Cuong Nguyen <[email protected]>

can-anyscale force-pushed the can-ci-errors branch from e8e2672 to 72ea4a1 Compare April 16, 2023 00:00

Fix tests

a427305

Signed-off-by: Cuong Nguyen <[email protected]>

krfricke approved these changes Apr 17, 2023

View reviewed changes

@krfricke's comments

1f10c44

Signed-off-by: Cuong Nguyen <[email protected]>

krfricke merged commit 20aa7b7 into master Apr 17, 2023

krfricke deleted the can-ci-errors branch April 17, 2023 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI][Green-Ray][1] Automated retry of infra-error release tests #34057

[CI][Green-Ray][1] Automated retry of infra-error release tests #34057

can-anyscale commented Apr 4, 2023 •

edited

Loading

krfricke left a comment

krfricke left a comment

can-anyscale commented Apr 13, 2023

krfricke commented Apr 13, 2023

can-anyscale commented Apr 13, 2023

can-anyscale commented Apr 13, 2023

can-anyscale Apr 13, 2023

krfricke left a comment

krfricke Apr 17, 2023

can-anyscale Apr 17, 2023

krfricke Apr 17, 2023

can-anyscale Apr 17, 2023

	if [ "$REASON" == "infra error" ] \|\| [ "$REASON" == "infra timeout" ]; then
	if [ "$REASON" == "infra_error" ] \|\| [ "$REASON" == "infra_timeout" ]; then

[CI][Green-Ray][1] Automated retry of infra-error release tests #34057

[CI][Green-Ray][1] Automated retry of infra-error release tests #34057

Conversation

can-anyscale commented Apr 4, 2023 • edited Loading

Why are these changes needed?

Checks

krfricke left a comment

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

can-anyscale commented Apr 13, 2023

krfricke commented Apr 13, 2023

can-anyscale commented Apr 13, 2023

can-anyscale commented Apr 13, 2023

can-anyscale Apr 13, 2023

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

krfricke Apr 17, 2023

Choose a reason for hiding this comment

can-anyscale Apr 17, 2023

Choose a reason for hiding this comment

krfricke Apr 17, 2023

Choose a reason for hiding this comment

can-anyscale Apr 17, 2023

Choose a reason for hiding this comment

can-anyscale commented Apr 4, 2023 •

edited

Loading