[serve] Add initial health check before marking a replica as RUNNING #31189

zcin · 2022-12-19T18:53:30Z

Signed-off-by: Cindy Zhang [email protected]

Why are these changes needed?

A deployment is marked as HEALTHY the moment we have reached the target number of RUNNING replicas. However, this is deceiving when replicas are repeatedly failing health checks, because the replicas will be marked RUNNING after successful initialization, and the deployment status will not change until 30 seconds later when the replica fails 3 consecutive health checks and is restarted (then the loop starts again).

This change will ensure that a replica is not "healthy by default"; instead, the replica must pass one health check immediately after startup. If it does, the replica is marked as RUNNING; if not, it is treated like a startup failure and stopped and restarted on the spot.

Related issue number

"Closes #26204"

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Cindy Zhang <[email protected]>

zcin · 2022-12-20T03:15:42Z

Windows tests unrelated.

Signed-off-by: Cindy Zhang <[email protected]>

edoakes

I wonder if it's possible to totally unify this with the initialization check -- can we just make the health check call first check that the user class is initialized and use the same method for both? Basically in replica.py:

async def check_health_wrapper(self):
    await self._wait_for_constructor()
    ... do health check as we do now ...

edoakes · 2022-12-31T21:08:36Z

python/ray/serve/_private/deployment_state.py

@@ -69,8 +69,9 @@ class ReplicaState(Enum):
 class ReplicaStartupStatus(Enum):
    PENDING_ALLOCATION = 1
    PENDING_INITIALIZATION = 2
-    SUCCEEDED = 3
-    FAILED = 4
+    PENDING_INITIAL_HEALTH_CHECK = 3


what's the reason to separate this from PENDING_INITIALIZATION? if there is no need for the distinction let's err on the side of having fewer states (easier to reason about)

sihanwang41 · 2023-01-03T02:18:13Z

python/ray/serve/_private/deployment_state.py

-            except Exception:
-                logger.exception(f"Exception in deployment '{self._deployment_name}'")
-                return ReplicaStartupStatus.FAILED, None
+


please update the check_ready method description.

sihanwang41 · 2023-01-03T02:19:39Z

python/ray/serve/tests/test_healthcheck.py

@@ -202,6 +203,27 @@ def check_fails_3_times():
    check_fails_3_times()


+def test_health_check_failure_makes_deployment_unhealthy(serve_instance):


add a test:
The deployment is HEALTHY -> UNHEALTHY when there is a replica failing the health check.

Signed-off-by: Cindy Zhang <[email protected]>

edoakes · 2023-01-05T18:43:34Z

python/ray/serve/_private/replica.py

+        async def is_ready(
            self, user_config: Optional[Any] = None, _after: Optional[Any] = None
+        ):
+            await self._initialize_replica()
+
+            if user_config is not None:
+                await self.reconfigure(user_config)
+
+            # A new replica should not be considered healthy until it passes an
+            # initial health check. If an initial health check fails, consider
+            # it an initialization failure.
+            await self.check_health()
+            return self.get_metadata()
+
+        async def reconfigure(
+            self, user_config: Optional[Any] = None
        ) -> Tuple[DeploymentConfig, DeploymentVersion]:
            # Unused `_after` argument is for scheduling: passing an ObjectRef
            # allows delaying reconfiguration until after this call has returned.
-            if self.replica is None:
-                await self._initialize_replica()
            if user_config is not None:
                await self.replica.reconfigure(user_config)


this is a nice simplification :)

nit: we seem to be mixing the terminology of "ready" and "initialized" -- let's pick one and standardize on it both here and in the controller?

Thanks! I picked "initialized" since it seems "ready" is used more broadly in deployment_state e.g. if the user config is updated.

edoakes · 2023-01-05T18:45:42Z

python/ray/serve/tests/test_healthcheck.py

+    serve.run(WillBeUnhealthy.bind(toggle))
+
+    # Check that deployment is healthy initially
+    assert check_status("HEALTHY")


nit: use enum instead of string directly

Signed-off-by: Cindy Zhang <[email protected]>

zcin · 2023-01-09T18:14:54Z

@edoakes @sihanwang41 Addressed comments, PTAL!

scv119 · 2023-01-09T23:20:03Z

@zcin this breaks bk://:java and bk://:mac: :apple: Ray C++ and Java

edoakes · 2023-01-09T23:34:47Z

That's on me, wasn't careful enough with merging. @zcin could you make a revert PR?

…RUNNING (#31189)" This reverts commit 6e5bb24. Signed-off-by: Cindy Zhang <[email protected]>

…RUNNING (#31189)" (#31548) This reverts commit 6e5bb24. Signed-off-by: Cindy Zhang <[email protected]>

…lica as RUNNING (#31189)" (#31548)" This reverts commit 15676dd. Signed-off-by: Cindy Zhang <[email protected]>

#31554) #31189 broke the Java codepath. This PR fixes that and also adds the initial health check to Java behavior.

…31189) A deployment is marked as HEALTHY the moment we have reached the target number of RUNNING replicas. However, this is deceiving when replicas are repeatedly failing health checks, because the replicas will be marked RUNNING after successful initialization, and the deployment status will not change until 30 seconds later when the replica fails 3 consecutive health checks and is restarted (then the loop starts again). This change will ensure that a replica is not "healthy by default"; instead, the replica must pass one health check immediately after startup. If it does, the replica is marked as RUNNING; if not, it is treated like a startup failure and stopped and restarted on the spot.

…RUNNING (#31189)" (#31548) This reverts commit 6e5bb24. Signed-off-by: Cindy Zhang <[email protected]>

#31554) #31189 broke the Java codepath. This PR fixes that and also adds the initial health check to Java behavior.

zcin added 2 commits December 19, 2022 10:47

add initial health check

669043a

Signed-off-by: Cindy Zhang <[email protected]>

add test

83d18f5

Signed-off-by: Cindy Zhang <[email protected]>

zcin marked this pull request as ready for review December 20, 2022 03:14

zcin requested a review from sihanwang41 December 20, 2022 03:15

zcin requested review from shrekris-anyscale, simon-mo and edoakes December 20, 2022 19:15

Merge branch 'master' into check-health

2ca0e31

Signed-off-by: Cindy Zhang <[email protected]>

edoakes reviewed Dec 31, 2022

View reviewed changes

edoakes added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 31, 2022

sihanwang41 reviewed Jan 3, 2023

View reviewed changes

zcin added 7 commits January 3, 2023 08:19

Merge branch 'master' into check-health

797f15f

Signed-off-by: Cindy Zhang <[email protected]>

unify health check with initialization

b9fd82f

Signed-off-by: Cindy Zhang <[email protected]>

Merge branch 'master' into check-health

7bb151e

Signed-off-by: Cindy Zhang <[email protected]>

fix

3e1d8d6

Signed-off-by: Cindy Zhang <[email protected]>

add test + update comments

a8c8d46

Signed-off-by: Cindy Zhang <[email protected]>

small changes

c25e024

Signed-off-by: Cindy Zhang <[email protected]>

small changes

31ac90b

Signed-off-by: Cindy Zhang <[email protected]>

zcin added serve Ray Serve Related Issue and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Jan 4, 2023

zcin requested review from edoakes and sihanwang41 January 4, 2023 18:11

edoakes approved these changes Jan 5, 2023

View reviewed changes

zcin added 3 commits January 5, 2023 13:31

apply changes

00b19ce

Signed-off-by: Cindy Zhang <[email protected]>

Merge branch 'master' into check-health

03181f1

Signed-off-by: Cindy Zhang <[email protected]>

improve test comment

c46e7e9

Signed-off-by: Cindy Zhang <[email protected]>

edoakes merged commit 6e5bb24 into ray-project:master Jan 9, 2023

zcin mentioned this pull request Jan 9, 2023

Revert "[serve] Add initial health check before marking a replica as RUNNING" #31548

Merged

zcin added a commit that referenced this pull request Jan 9, 2023

Revert "[serve] Add initial health check before marking a replica as …

a622f25

…RUNNING (#31189)" This reverts commit 6e5bb24. Signed-off-by: Cindy Zhang <[email protected]>

edoakes pushed a commit that referenced this pull request Jan 9, 2023

Revert "[serve] Add initial health check before marking a replica as …

15676dd

…RUNNING (#31189)" (#31548) This reverts commit 6e5bb24. Signed-off-by: Cindy Zhang <[email protected]>

zcin mentioned this pull request Jan 10, 2023

Revert "Revert "[serve] Add initial health check before marking a rep… #31554

Merged

7 tasks

zcin added a commit that referenced this pull request Jan 10, 2023

Revert "Revert "[serve] Add initial health check before marking a rep…

f84577c

…lica as RUNNING (#31189)" (#31548)" This reverts commit 15676dd. Signed-off-by: Cindy Zhang <[email protected]>

edoakes pushed a commit that referenced this pull request Jan 10, 2023

Revert "Revert "[serve] Add initial health check before marking a rep… (

6a7edce

#31554) #31189 broke the Java codepath. This PR fixes that and also adds the initial health check to Java behavior.

AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023

Revert "[serve] Add initial health check before marking a replica as …

fbf9ec1

…RUNNING (#31189)" (#31548) This reverts commit 6e5bb24. Signed-off-by: Cindy Zhang <[email protected]>

AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023

Revert "Revert "[serve] Add initial health check before marking a rep… (

268ea02

#31554) #31189 broke the Java codepath. This PR fixes that and also adds the initial health check to Java behavior.

zcin deleted the check-health branch January 13, 2023 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] Add initial health check before marking a replica as RUNNING #31189

[serve] Add initial health check before marking a replica as RUNNING #31189

zcin commented Dec 19, 2022 •

edited

Loading

zcin commented Dec 20, 2022

edoakes left a comment

edoakes Dec 31, 2022

sihanwang41 Jan 3, 2023

sihanwang41 Jan 3, 2023

zcin Jan 5, 2023

edoakes Jan 5, 2023

edoakes Jan 5, 2023

zcin Jan 5, 2023

edoakes Jan 5, 2023

zcin commented Jan 9, 2023

scv119 commented Jan 9, 2023

edoakes commented Jan 9, 2023

		@@ -202,6 +203,27 @@ def check_fails_3_times():
		check_fails_3_times()


		def test_health_check_failure_makes_deployment_unhealthy(serve_instance):

[serve] Add initial health check before marking a replica as RUNNING #31189

[serve] Add initial health check before marking a replica as RUNNING #31189

Conversation

zcin commented Dec 19, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

zcin commented Dec 20, 2022

edoakes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zcin commented Jan 9, 2023

scv119 commented Jan 9, 2023

edoakes commented Jan 9, 2023

zcin commented Dec 19, 2022 •

edited

Loading