Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add liveness and readiness probes for Ray worker pods. #308

Closed
1 of 2 tasks
DmitriGekhtman opened this issue Jun 15, 2022 · 5 comments
Closed
1 of 2 tasks
Labels
enhancement New feature or request

Comments

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Jun 15, 2022

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

The KubeRay operator should inject liveness and readiness probes for Ray pods.
Ray provides existing functionality ray health-check that should work for this purpose.
Readiness information should be carried into the RayCluster CR's status field.

The Ray autoscaler's resource heartbeat health-check should be toggled off once this is implemented.
(need to provide a flag in the Ray autoscaler code to turn that function off).

Use case

Better management of Ray worker pods!

Related issues

Exposing status info in the CRD
#223

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@DmitriGekhtman DmitriGekhtman added the enhancement New feature or request label Jun 15, 2022
@DmitriGekhtman
Copy link
Collaborator Author

@wuisawesome @iycheng @mwtian come to think of it, ray health-check is not a good health check of a Ray worker -- it's actually a remote check of the GCS.

Is there a good way to health check a Ray worker node?

@DmitriGekhtman
Copy link
Collaborator Author

cc @akanso @Jeffwan @wilsonwang371
cc @daikeshi @davidxia re: the connection to RayCluster.Status

@wuisawesome
Copy link

oh whoops i missed the sync, but we should implement it as a flag for ray healthcheck.

There's a question of if we should attempt to directly ping the raylet, or ask GCS/get the status from the NodeTable.

Who would be responsible for probing? k8s deployment controller? or kuberay operator?

@DmitriGekhtman
Copy link
Collaborator Author

Who would be responsible for probing? k8s deployment controller? or kuberay operator?

The kubelet of the node on which the Ray pod is scheduled.

There's a question of if we should attempt to directly ping the raylet, or ask GCS/get the status from the NodeTable.

Maybe both? Especially if there's a possibility that these could fail independently.

@DmitriGekhtman
Copy link
Collaborator Author

DmitriGekhtman commented Jul 13, 2022

This is being covered in the context of Ray HA work. Closing to deduplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants