Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService][Health-Check][8/n] Add readiness / liveness probes #1674

Merged
merged 4 commits into from
Nov 29, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Nov 22, 2023

Why are these changes needed?

KubeRay only includes readiness/liveness probes when GCS is enabled. Therefore, if GCS is not enabled and the Raylet/dashboard agent processes crash, they will not be detected. Hence, we should inject the probes no matter whether the GCS is enabled or not.

In #1656, we decide to offload the health check responsibilities, including the dashboard agent on Ray head and Ray Serve applications, to K8s and Ray. With this PR, all RayCluster custom resources have probes to check the status of the Raylet which is fate-sharing with the Ray dashboard agent.

The compatibility test KubeRayHealthCheckTestCase for Ray 1.13.0 will fail because Ray 1.13.0 does not support the /api/local_raylet_healthz health-check endpoint. See here for the evidence. It is acceptable to remove the compatibility test for Ray 1.13.0 because it lacks support for many KubeRay features, such as GCS FT, Autoscaler, and so on. Additionally, I have hardly heard of any users using versions older than Ray 2.4.0 in the past three months.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@@ -58,6 +59,24 @@ def test_cluster_info(self):
"""Execute "print(ray.cluster_resources())" in the head Pod."""
EasyJobRule().assert_rule()

def test_probe_injection(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Run the test with kuberay/operator:v1.0.0. The test should fail.

    RAY_IMAGE=rayproject/ray:nightly OPERATOR_IMAGE=kuberay/operator:v1.0.0 python3 tests/compatibility-test.py BasicRayTestCase 2>&1
    Screen Shot 2023-11-27 at 11 18 29 PM
  • Run the test with controller:latest (this PR). The test should pass.

    RAY_IMAGE=rayproject/ray:nightly OPERATOR_IMAGE=controller:latest python3 tests/compatibility-test.py BasicRayTestCase 2>&1
    Screen Shot 2023-11-27 at 11 19 59 PM

@kevin85421 kevin85421 marked this pull request as ready for review November 28, 2023 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants