-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Add initial health check before marking a replica as RUNNING #31189
Conversation
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Windows tests unrelated. |
Signed-off-by: Cindy Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's possible to totally unify this with the initialization check -- can we just make the health check call first check that the user class is initialized and use the same method for both? Basically in replica.py
:
async def check_health_wrapper(self):
await self._wait_for_constructor()
... do health check as we do now ...
@@ -69,8 +69,9 @@ class ReplicaState(Enum): | |||
class ReplicaStartupStatus(Enum): | |||
PENDING_ALLOCATION = 1 | |||
PENDING_INITIALIZATION = 2 | |||
SUCCEEDED = 3 | |||
FAILED = 4 | |||
PENDING_INITIAL_HEALTH_CHECK = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reason to separate this from PENDING_INITIALIZATION
? if there is no need for the distinction let's err on the side of having fewer states (easier to reason about)
except Exception: | ||
logger.exception(f"Exception in deployment '{self._deployment_name}'") | ||
return ReplicaStartupStatus.FAILED, None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update the check_ready
method description.
@@ -202,6 +203,27 @@ def check_fails_3_times(): | |||
check_fails_3_times() | |||
|
|||
|
|||
def test_health_check_failure_makes_deployment_unhealthy(serve_instance): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a test:
The deployment is HEALTHY
-> UNHEALTHY
when there is a replica failing the health check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
python/ray/serve/_private/replica.py
Outdated
async def is_ready( | ||
self, user_config: Optional[Any] = None, _after: Optional[Any] = None | ||
): | ||
await self._initialize_replica() | ||
|
||
if user_config is not None: | ||
await self.reconfigure(user_config) | ||
|
||
# A new replica should not be considered healthy until it passes an | ||
# initial health check. If an initial health check fails, consider | ||
# it an initialization failure. | ||
await self.check_health() | ||
return self.get_metadata() | ||
|
||
async def reconfigure( | ||
self, user_config: Optional[Any] = None | ||
) -> Tuple[DeploymentConfig, DeploymentVersion]: | ||
# Unused `_after` argument is for scheduling: passing an ObjectRef | ||
# allows delaying reconfiguration until after this call has returned. | ||
if self.replica is None: | ||
await self._initialize_replica() | ||
if user_config is not None: | ||
await self.replica.reconfigure(user_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a nice simplification :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we seem to be mixing the terminology of "ready" and "initialized" -- let's pick one and standardize on it both here and in the controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I picked "initialized" since it seems "ready" is used more broadly in deployment_state
e.g. if the user config is updated.
serve.run(WillBeUnhealthy.bind(toggle)) | ||
|
||
# Check that deployment is healthy initially | ||
assert check_status("HEALTHY") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use enum instead of string directly
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
@edoakes @sihanwang41 Addressed comments, PTAL! |
@zcin this breaks bk://:java and bk://:mac: :apple: Ray C++ and Java |
That's on me, wasn't careful enough with merging. @zcin could you make a revert PR? |
…RUNNING (#31189)" This reverts commit 6e5bb24. Signed-off-by: Cindy Zhang <[email protected]>
…RUNNING (#31189)" (#31548) This reverts commit 6e5bb24. Signed-off-by: Cindy Zhang <[email protected]>
…lica as RUNNING (#31189)" (#31548)" This reverts commit 15676dd. Signed-off-by: Cindy Zhang <[email protected]>
…31189) A deployment is marked as HEALTHY the moment we have reached the target number of RUNNING replicas. However, this is deceiving when replicas are repeatedly failing health checks, because the replicas will be marked RUNNING after successful initialization, and the deployment status will not change until 30 seconds later when the replica fails 3 consecutive health checks and is restarted (then the loop starts again). This change will ensure that a replica is not "healthy by default"; instead, the replica must pass one health check immediately after startup. If it does, the replica is marked as RUNNING; if not, it is treated like a startup failure and stopped and restarted on the spot.
…RUNNING (#31189)" (#31548) This reverts commit 6e5bb24. Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang [email protected]
Why are these changes needed?
A deployment is marked as HEALTHY the moment we have reached the target number of RUNNING replicas. However, this is deceiving when replicas are repeatedly failing health checks, because the replicas will be marked RUNNING after successful initialization, and the deployment status will not change until 30 seconds later when the replica fails 3 consecutive health checks and is restarted (then the loop starts again).
This change will ensure that a replica is not "healthy by default"; instead, the replica must pass one health check immediately after startup. If it does, the replica is marked as RUNNING; if not, it is treated like a startup failure and stopped and restarted on the spot.
Related issue number
"Closes #26204"
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.