Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Add exponential backoff when retrying replicas #31436

Merged
merged 10 commits into from
Jan 27, 2023

Conversation

zcin
Copy link
Contributor

@zcin zcin commented Jan 4, 2023

Signed-off-by: Cindy Zhang [email protected]

Why are these changes needed?

If deployment is repeatedly failing, perform exponential backoff so as to not repeatedly try to restart the replica at a very fast rate.

Related issue number

Closes #31121

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
@zcin zcin changed the title [serve] add exponential backoff [WIP][serve] add exponential backoff Jan 5, 2023
@zcin zcin marked this pull request as ready for review January 5, 2023 22:41
@zcin zcin changed the title [WIP][serve] add exponential backoff [serve] add exponential backoff Jan 10, 2023
@zcin zcin changed the title [serve] add exponential backoff [serve] Add exponential backoff when retrying replicas Jan 12, 2023
# Exponential backoff when retrying a consistently failing deployment
self._last_retry: float = 0.0
self._backoff_time: int = 1
self._max_backoff: int = 64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _max_backoff_time_s, _backoff_time_s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider making these parametrizable via env var

Copy link
Contributor Author

@zcin zcin Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions, applied! I made the backoff factor and the max backoff time env variables.

python/ray/serve/_private/deployment_state.py Show resolved Hide resolved
@architkulkarni architkulkarni self-assigned this Jan 24, 2023
Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@zcin what is the behavior under controller failure?

# Exponential backoff when retrying a consistently failing deployment
self._last_retry: float = 0.0
self._backoff_time: int = 1
self._max_backoff: int = 64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider making these parametrizable via env var

# Exponential backoff
failed_to_start_threshold = min(
MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT,
self._target_state.num_replicas * 3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for choosing self._target_state.num_replicas * 3?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah is it to compare against the total number of replica restarts across the deployment? (So on average each replica has failed 3 times?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the threshold used for setting the deployment unhealthy: code pointer. Basically, perform exponential backoff after a replica fails 3 times and the deployment is determined unhealthy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah is it to compare against the total number of replica restarts across the deployment? (So on average each replica has failed 3 times?)

Yup, I believe so.

@zcin
Copy link
Contributor Author

zcin commented Jan 24, 2023

@zcin what is the behavior under controller failure?

@edoakes Not sure if there's a way to test, but I believe all related state is reset. From what I read in the code, _replica_constructor_retry_counter itself is reset, and backoff_time_s is reset as well (to 1 sec); the only state recovered for each deployment is the target state.

@zcin
Copy link
Contributor Author

zcin commented Jan 27, 2023

@edoakes @architkulkarni Could you take another look? lmk if there's anything else I should address.

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@architkulkarni
Copy link
Contributor

Behavior under controller reset sounds reasonable. Ed is out until Tuesday so I'll merge this.

The failing doc test is unrelated (rllib)

@architkulkarni architkulkarni merged commit 3f1a880 into ray-project:master Jan 27, 2023
@zcin zcin deleted the exponential-backoff branch January 31, 2023 17:49
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…1436)

If deployment is repeatedly failing, perform exponential backoff so as to not repeatedly try to restart the replica at a very fast rate.

Related issue number
Closes ray-project#31121

Signed-off-by: Edward Oakes <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[serve] Infinite restarting of replica when deployment constructor consistently fails
4 participants