-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI][Green-Ray][1] Automated retry of infra-error release tests #34057
Conversation
bfed6cc
to
9af76b8
Compare
fe5bab3
to
43f52c6
Compare
23bc4c7
to
72c0f9b
Compare
12ce837
to
70b9852
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we need most of the refactoring here.
We already expose exit codes that are much more descriptive than here, e.g.:
# Hard infra errors (non-retryable)
CLI_ERROR = 10
CONFIG_ERROR = 11
SETUP_ERROR = 12
CLUSTER_RESOURCE_ERROR = 13
CLUSTER_ENV_BUILD_ERROR = 14
CLUSTER_STARTUP_ERROR = 15
LOCAL_ENV_SETUP_ERROR = 16
REMOTE_ENV_SETUP_ERROR = 17
FETCH_RESULT_ERROR = 18
ANYSCALE_ERROR = 19
# Infra timeouts (retryable)
RAY_WHEELS_TIMEOUT = 30
CLUSTER_ENV_BUILD_TIMEOUT = 31
CLUSTER_STARTUP_TIMEOUT = 32
CLUSTER_WAIT_TIMEOUT = 33
Wouldn't it be enough to set
- exit_status: 30
limit: 3
- exit_status: 31
limit: 3
etc?
I'd like to have clarity on this before merging so requesting changes for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think the retry logic exists in a similar matter in run_release_test.sh
- only that we we retry there in the same job and not in a repeated job.
Would it make sense to adjust that script to return a "retry" or "non-retry" exit code to buildkite? That way we can retain the existing exit codes in the python script and move buildkite-specific logic into the buildkite wrapper.
@krfricke ah nice, I didn't notice retry logic exists in the wrapper as well. I still kind of like a different result.status (currently timeout, error, infra_error, infra_timeout) for jobs that are retried, as they are stored in databricks, for further analysis and tracking purpose though. So I can keep most of the logic to update result.status and result.last_logs (lot of refactoring here is to capture this for infra-error cases), keep the existing return code the same and move buildkite return code/retry to the wrapper. How do you like that? Or is it easier to understand to keep the retry logic in the same place as the computation of result.status. |
Yeah that sounds good to me! Another thing, our dashboards currently use the status string to parse statuses, we should make sure that we keep the existing names for backwards compatibility or adjust the dashboards to be compatible. E.g. |
@krfricke: awesome, I'll make sure to update the dashboard too! |
Good for review now. Re-tested as well: https://buildkite.com/ray-project/release-tests-pr/builds/34922. Thanks for reviewing @krfricke |
pipeline_exception = None | ||
# non critical for some tests. So separate it from the general one. | ||
fetch_result_exception = None | ||
try: | ||
buildkite_group(":spiral_note_pad: Loading test configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the block under try-catch so all errors are handled through handle_exception
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
e8e2672
to
72ea4a1
Compare
Signed-off-by: Cuong Nguyen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG! Thanks for the revision.
Ping me to merge
@@ -173,4 +175,8 @@ if [ -z "${NO_CLONE}" ]; then | |||
rm -rf "${TMPDIR}" || true | |||
fi | |||
|
|||
exit $EXIT_CODE | |||
if [ "$REASON" == "infra error" ] || [ "$REASON" == "infra timeout" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if [ "$REASON" == "infra error" ] || [ "$REASON" == "infra timeout" ]; then | |
if [ "$REASON" == "infra_error" ] || [ "$REASON" == "infra_timeout" ]; then |
underscores missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are without underscore; obtained from the reason() function on top of this file ;)
@@ -94,7 +94,7 @@ def test_repeat(setup): | |||
ExitCode.CLUSTER_WAIT_TIMEOUT, | |||
ExitCode.RAY_WHEELS_TIMEOUT, | |||
) | |||
== ExitCode.RAY_WHEELS_TIMEOUT.value | |||
== 79 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
== 79 | |
== 79 # BUILDKITE_RETRY_CODE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally!
Signed-off-by: Cuong Nguyen <[email protected]>
…34110) In #34057, I made it so far release tests that fail with infra-error will automatically retry once. This PR makes it so that, not only it has to fail with infra-error, it has to run within less than 30 minutes as well. Signed-off-by: Cuong Nguyen <[email protected]>
…project#34057) This PR is a part of my effort to make OSS release test run greener, starting with reducing infra error rates. Other work such as [this from Lonnie](https://docs.google.com/document/d/1hF7h8F19qFWFxH9WVeT8fWwVuNyUyHLTx-7LP3uxD50/edit#heading=h.i0cvl0u8jbfu) fixes systematic issues such as unstable Anyscale staging environment. This PR addresses transient issues with Anyscale that are hard to avoid in a distributed system. On a day Anyscale behaves well, transient issue seem to be around [2-3%](https://b534fd88.us1a.app.preset.io/superset/dashboard/43/?force=false&native_filters_key=MoYaGptJfGwbkF60A7RSzfoRLL_ypDf_JvNFxp2YGQ8Ls4CNgbAWEBh0WcOkOLsS), aka. 4 random failures for a test suite of 200 tests, annoying! Concretely it will: - First, classify an infra test run as a transient infra issue - Instruct buildkite to automatically retry on transient issue - If retry runs out, classify the infra test run as infra issue Some other limitations that will be addressed in followup PRs: - Move infra-failure retry configuration into LaunchDarkly? - Limit auto-retry based on test cost or test runtime Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
…ay-project#34110) In ray-project#34057, I made it so far release tests that fail with infra-error will automatically retry once. This PR makes it so that, not only it has to fail with infra-error, it has to run within less than 30 minutes as well. Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
…project#34057) This PR is a part of my effort to make OSS release test run greener, starting with reducing infra error rates. Other work such as [this from Lonnie](https://docs.google.com/document/d/1hF7h8F19qFWFxH9WVeT8fWwVuNyUyHLTx-7LP3uxD50/edit#heading=h.i0cvl0u8jbfu) fixes systematic issues such as unstable Anyscale staging environment. This PR addresses transient issues with Anyscale that are hard to avoid in a distributed system. On a day Anyscale behaves well, transient issue seem to be around [2-3%](https://b534fd88.us1a.app.preset.io/superset/dashboard/43/?force=false&native_filters_key=MoYaGptJfGwbkF60A7RSzfoRLL_ypDf_JvNFxp2YGQ8Ls4CNgbAWEBh0WcOkOLsS), aka. 4 random failures for a test suite of 200 tests, annoying! Concretely it will: - First, classify an infra test run as a transient infra issue - Instruct buildkite to automatically retry on transient issue - If retry runs out, classify the infra test run as infra issue Some other limitations that will be addressed in followup PRs: - Move infra-failure retry configuration into LaunchDarkly? - Limit auto-retry based on test cost or test runtime Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Jack He <[email protected]>
…ay-project#34110) In ray-project#34057, I made it so far release tests that fail with infra-error will automatically retry once. This PR makes it so that, not only it has to fail with infra-error, it has to run within less than 30 minutes as well. Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Jack He <[email protected]>
Why are these changes needed?
This might or might not be a controversial PR, so any feedback here is super welcome.
This PR is a part of my effort to make OSS release test run greener, starting with reducing infra error rates. Other work such as this from Lonnie fixes systematic issues such as unstable Anyscale staging environment. This PR addresses transient issues with Anyscale that are hard to avoid in a distributed system. On a day Anyscale behaves well, transient issue seem to be around 2-3%, aka. 4 random failures for a test suite of 200 tests, annoying!
Concretely it will:
Some other limitations that will be addressed in followup PRs:
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.