[serve] Add exponential backoff when retrying replicas #31436

zcin · 2023-01-04T19:16:31Z

Signed-off-by: Cindy Zhang [email protected]

Why are these changes needed?

If deployment is repeatedly failing, perform exponential backoff so as to not repeatedly try to restart the replica at a very fast rate.

Related issue number

Closes #31121

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Cindy Zhang <[email protected]>

sihanwang41 · 2023-01-19T17:47:57Z

python/ray/serve/_private/deployment_state.py

+        # Exponential backoff when retrying a consistently failing deployment
+        self._last_retry: float = 0.0
+        self._backoff_time: int = 1
+        self._max_backoff: int = 64


nit: _max_backoff_time_s, _backoff_time_s

consider making these parametrizable via env var

Thanks for the suggestions, applied! I made the backoff factor and the max backoff time env variables.

python/ray/serve/_private/deployment_state.py

edoakes

Looks good.

@zcin what is the behavior under controller failure?

edoakes · 2023-01-24T19:36:11Z

python/ray/serve/_private/deployment_state.py

+        # Exponential backoff when retrying a consistently failing deployment
+        self._last_retry: float = 0.0
+        self._backoff_time: int = 1
+        self._max_backoff: int = 64


consider making these parametrizable via env var

architkulkarni · 2023-01-24T19:37:18Z

python/ray/serve/_private/deployment_state.py

+                # Exponential backoff
+                failed_to_start_threshold = min(
+                    MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT,
+                    self._target_state.num_replicas * 3,


What's the reason for choosing self._target_state.num_replicas * 3?

Ah is it to compare against the total number of replica restarts across the deployment? (So on average each replica has failed 3 times?)

I used the threshold used for setting the deployment unhealthy: code pointer. Basically, perform exponential backoff after a replica fails 3 times and the deployment is determined unhealthy.

Ah is it to compare against the total number of replica restarts across the deployment? (So on average each replica has failed 3 times?)

Yup, I believe so.

Signed-off-by: Cindy Zhang <[email protected]>

zcin · 2023-01-24T22:56:32Z

@zcin what is the behavior under controller failure?

@edoakes Not sure if there's a way to test, but I believe all related state is reset. From what I read in the code, _replica_constructor_retry_counter itself is reset, and backoff_time_s is reset as well (to 1 sec); the only state recovered for each deployment is the target state.

zcin · 2023-01-27T17:03:53Z

@edoakes @architkulkarni Could you take another look? lmk if there's anything else I should address.

architkulkarni

Looks good!

architkulkarni · 2023-01-27T17:18:21Z

Behavior under controller reset sounds reasonable. Ed is out until Tuesday so I'll merge this.

The failing doc test is unrelated (rllib)

…1436) If deployment is repeatedly failing, perform exponential backoff so as to not repeatedly try to restart the replica at a very fast rate. Related issue number Closes ray-project#31121 Signed-off-by: Edward Oakes <[email protected]>

zcin added 3 commits January 4, 2023 11:15

add exponential backoff

889512a

Signed-off-by: Cindy Zhang <[email protected]>

fix

aa5706a

Signed-off-by: Cindy Zhang <[email protected]>

add unit test

33152f4

Signed-off-by: Cindy Zhang <[email protected]>

zcin changed the title ~~[serve] add exponential backoff~~ [WIP][serve] add exponential backoff Jan 5, 2023

zcin marked this pull request as ready for review January 5, 2023 22:41

zcin requested review from sihanwang41, shrekris-anyscale, edoakes and simon-mo January 5, 2023 22:41

zcin changed the title ~~[WIP][serve] add exponential backoff~~ [serve] add exponential backoff Jan 10, 2023

zcin added 5 commits January 10, 2023 08:45

Merge branch 'master' into exponential-backoff

40bdbb0

Signed-off-by: Cindy Zhang <[email protected]>

Merge branch 'master' into exponential-backoff

0d54eca

Signed-off-by: Cindy Zhang <[email protected]>

improve test

523bd5e

Signed-off-by: Cindy Zhang <[email protected]>

improvements

8aacc68

Signed-off-by: Cindy Zhang <[email protected]>

Merge branch 'master' into exponential-backoff

d02a066

Signed-off-by: Cindy Zhang <[email protected]>

zcin changed the title ~~[serve] add exponential backoff~~ [serve] Add exponential backoff when retrying replicas Jan 12, 2023

sihanwang41 reviewed Jan 19, 2023

View reviewed changes

sihanwang41 approved these changes Jan 19, 2023

View reviewed changes

zcin requested a review from architkulkarni January 23, 2023 18:24

architkulkarni self-assigned this Jan 24, 2023

edoakes reviewed Jan 24, 2023

View reviewed changes

architkulkarni reviewed Jan 24, 2023

View reviewed changes

zcin added 2 commits January 24, 2023 14:12

Merge branch 'master' into exponential-backoff

86bf391

Signed-off-by: Cindy Zhang <[email protected]>

address comments

86dae89

Signed-off-by: Cindy Zhang <[email protected]>

architkulkarni approved these changes Jan 27, 2023

View reviewed changes

architkulkarni merged commit 3f1a880 into ray-project:master Jan 27, 2023

zcin deleted the exponential-backoff branch January 31, 2023 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] Add exponential backoff when retrying replicas #31436

[serve] Add exponential backoff when retrying replicas #31436

zcin commented Jan 4, 2023 •

edited

Loading

sihanwang41 Jan 19, 2023

edoakes Jan 24, 2023

zcin Jan 24, 2023 •

edited

Loading

edoakes left a comment

edoakes Jan 24, 2023

architkulkarni Jan 24, 2023

architkulkarni Jan 24, 2023

zcin Jan 24, 2023

zcin Jan 24, 2023

zcin commented Jan 24, 2023 •

edited

Loading

zcin commented Jan 27, 2023

architkulkarni left a comment

architkulkarni commented Jan 27, 2023

[serve] Add exponential backoff when retrying replicas #31436

[serve] Add exponential backoff when retrying replicas #31436

Conversation

zcin commented Jan 4, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

sihanwang41 Jan 19, 2023

Choose a reason for hiding this comment

edoakes Jan 24, 2023

Choose a reason for hiding this comment

zcin Jan 24, 2023 • edited Loading

Choose a reason for hiding this comment

edoakes left a comment

Choose a reason for hiding this comment

edoakes Jan 24, 2023

Choose a reason for hiding this comment

architkulkarni Jan 24, 2023

Choose a reason for hiding this comment

architkulkarni Jan 24, 2023

Choose a reason for hiding this comment

zcin Jan 24, 2023

Choose a reason for hiding this comment

zcin Jan 24, 2023

Choose a reason for hiding this comment

zcin commented Jan 24, 2023 • edited Loading

zcin commented Jan 27, 2023

architkulkarni left a comment

Choose a reason for hiding this comment

architkulkarni commented Jan 27, 2023

zcin commented Jan 4, 2023 •

edited

Loading

zcin Jan 24, 2023 •

edited

Loading

zcin commented Jan 24, 2023 •

edited

Loading