Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Add initial health check before marking a replica as RUNNING #31189

Merged
merged 13 commits into from
Jan 9, 2023

Conversation

zcin
Copy link
Contributor

@zcin zcin commented Dec 19, 2022

Signed-off-by: Cindy Zhang [email protected]

Why are these changes needed?

A deployment is marked as HEALTHY the moment we have reached the target number of RUNNING replicas. However, this is deceiving when replicas are repeatedly failing health checks, because the replicas will be marked RUNNING after successful initialization, and the deployment status will not change until 30 seconds later when the replica fails 3 consecutive health checks and is restarted (then the loop starts again).

This change will ensure that a replica is not "healthy by default"; instead, the replica must pass one health check immediately after startup. If it does, the replica is marked as RUNNING; if not, it is treated like a startup failure and stopped and restarted on the spot.

Related issue number

"Closes #26204"

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
@zcin zcin marked this pull request as ready for review December 20, 2022 03:14
@zcin
Copy link
Contributor Author

zcin commented Dec 20, 2022

Windows tests unrelated.

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's possible to totally unify this with the initialization check -- can we just make the health check call first check that the user class is initialized and use the same method for both? Basically in replica.py:

async def check_health_wrapper(self):
    await self._wait_for_constructor()
    ... do health check as we do now ...

@@ -69,8 +69,9 @@ class ReplicaState(Enum):
class ReplicaStartupStatus(Enum):
PENDING_ALLOCATION = 1
PENDING_INITIALIZATION = 2
SUCCEEDED = 3
FAILED = 4
PENDING_INITIAL_HEALTH_CHECK = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason to separate this from PENDING_INITIALIZATION? if there is no need for the distinction let's err on the side of having fewer states (easier to reason about)

@edoakes edoakes added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 31, 2022
except Exception:
logger.exception(f"Exception in deployment '{self._deployment_name}'")
return ReplicaStartupStatus.FAILED, None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the check_ready method description.

@@ -202,6 +203,27 @@ def check_fails_3_times():
check_fails_3_times()


def test_health_check_failure_makes_deployment_unhealthy(serve_instance):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test:
The deployment is HEALTHY -> UNHEALTHY when there is a replica failing the health check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
@zcin zcin added serve Ray Serve Related Issue and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Jan 4, 2023
Comment on lines 217 to 237
async def is_ready(
self, user_config: Optional[Any] = None, _after: Optional[Any] = None
):
await self._initialize_replica()

if user_config is not None:
await self.reconfigure(user_config)

# A new replica should not be considered healthy until it passes an
# initial health check. If an initial health check fails, consider
# it an initialization failure.
await self.check_health()
return self.get_metadata()

async def reconfigure(
self, user_config: Optional[Any] = None
) -> Tuple[DeploymentConfig, DeploymentVersion]:
# Unused `_after` argument is for scheduling: passing an ObjectRef
# allows delaying reconfiguration until after this call has returned.
if self.replica is None:
await self._initialize_replica()
if user_config is not None:
await self.replica.reconfigure(user_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice simplification :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we seem to be mixing the terminology of "ready" and "initialized" -- let's pick one and standardize on it both here and in the controller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I picked "initialized" since it seems "ready" is used more broadly in deployment_state e.g. if the user config is updated.

serve.run(WillBeUnhealthy.bind(toggle))

# Check that deployment is healthy initially
assert check_status("HEALTHY")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use enum instead of string directly

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
@zcin
Copy link
Contributor Author

zcin commented Jan 9, 2023

@edoakes @sihanwang41 Addressed comments, PTAL!

@edoakes edoakes merged commit 6e5bb24 into ray-project:master Jan 9, 2023
@scv119
Copy link
Contributor

scv119 commented Jan 9, 2023

@zcin this breaks bk://:java and bk://:mac: :apple: Ray C++ and Java
image

@edoakes
Copy link
Contributor

edoakes commented Jan 9, 2023

That's on me, wasn't careful enough with merging. @zcin could you make a revert PR?

zcin added a commit that referenced this pull request Jan 9, 2023
edoakes pushed a commit that referenced this pull request Jan 9, 2023
zcin added a commit that referenced this pull request Jan 10, 2023
…lica as RUNNING (#31189)" (#31548)"

This reverts commit 15676dd.

Signed-off-by: Cindy Zhang <[email protected]>
edoakes pushed a commit that referenced this pull request Jan 10, 2023
#31554)

#31189 broke the Java codepath. This PR fixes that and also adds the initial health check to Java behavior.
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
…31189)

A deployment is marked as HEALTHY the moment we have reached the target number of RUNNING replicas. However, this is deceiving when replicas are repeatedly failing health checks, because the replicas will be marked RUNNING after successful initialization, and the deployment status will not change until 30 seconds later when the replica fails 3 consecutive health checks and is restarted (then the loop starts again).

This change will ensure that a replica is not "healthy by default"; instead, the replica must pass one health check immediately after startup. If it does, the replica is marked as RUNNING; if not, it is treated like a startup failure and stopped and restarted on the spot.
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
#31554)

#31189 broke the Java codepath. This PR fixes that and also adds the initial health check to Java behavior.
@zcin zcin deleted the check-health branch January 13, 2023 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[serve] Replica failing health check doesn't cause REST API to return unhealthy
4 participants