-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Add exponential backoff when retrying replicas #31436
Conversation
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
# Exponential backoff when retrying a consistently failing deployment | ||
self._last_retry: float = 0.0 | ||
self._backoff_time: int = 1 | ||
self._max_backoff: int = 64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: _max_backoff_time_s, _backoff_time_s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider making these parametrizable via env var
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions, applied! I made the backoff factor and the max backoff time env variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
@zcin what is the behavior under controller failure?
# Exponential backoff when retrying a consistently failing deployment | ||
self._last_retry: float = 0.0 | ||
self._backoff_time: int = 1 | ||
self._max_backoff: int = 64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider making these parametrizable via env var
# Exponential backoff | ||
failed_to_start_threshold = min( | ||
MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT, | ||
self._target_state.num_replicas * 3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for choosing self._target_state.num_replicas * 3
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah is it to compare against the total number of replica restarts across the deployment? (So on average each replica has failed 3 times?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the threshold used for setting the deployment unhealthy: code pointer. Basically, perform exponential backoff after a replica fails 3 times and the deployment is determined unhealthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah is it to compare against the total number of replica restarts across the deployment? (So on average each replica has failed 3 times?)
Yup, I believe so.
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
@edoakes Not sure if there's a way to test, but I believe all related state is reset. From what I read in the code, |
@edoakes @architkulkarni Could you take another look? lmk if there's anything else I should address. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Behavior under controller reset sounds reasonable. Ed is out until Tuesday so I'll merge this. The failing doc test is unrelated (rllib) |
…1436) If deployment is repeatedly failing, perform exponential backoff so as to not repeatedly try to restart the replica at a very fast rate. Related issue number Closes ray-project#31121 Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Cindy Zhang [email protected]
Why are these changes needed?
If deployment is repeatedly failing, perform exponential backoff so as to not repeatedly try to restart the replica at a very fast rate.
Related issue number
Closes #31121
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.