You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had searched in the issues and found no similar feature requirement.
Description
We would like to contribute a controller embedded in kuberay the operate a Ray Serve application on top of kuberay cluster.
apiVersion: serve.ray.io/v1kind: ServingClustermetadata:
name: .# status is populated by the operator, user can `kubectl get serve_deployments name | jq .medata.status` to receive the field.status: UPDATING|HEALTHY|UNHEALTHYspec:
healthCheckConfig: # optionalhealth_period_s: 5sconsecutive_failures_threshold: 3serveConfig:
- deploymentClass: .numReplicas: 2rayActorOptions: .rayClusterConfig:
apiVersion: cluster.ray.io/v1kind: RayClustermetadata:
generatedName: .spec:
maxWorkers: 2podTypes:
- name: headrayResources: .podConfig:
apiVersion: 1kind: Podmetadata:
generatedName: .spec:
containers:
- name: ray-nodeimage: my_registry/container:v1
This operator performs health checks, initial and redeployment of Serve app on kuberay cluster, and rotate cluster if the Serve application fails. The CR will exposes health checking status of Serve application.
You can find more information from this design doc
Conceptually this is similar to SparkJob and FlinkJob in their respective operator. It is a high level concept built on top of existing CRs.
Comparing to the Ray Jobs controller/CR design, service CR is designed to be long running and should outlive cluster failure. However, both workload uses Ray's REST API endpoint to perform operation on the Ray cluster.
Use case
Deploy Ray Serve application reliability on K8s cluster.
Manage Serve application in a cloud native way
Entrypoint to highly available application on Ray
Related issues
No response
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Hi. We have finalized some discuss about the design of the new k8s operator for Serve Deployment and RayCluster management. Here is our design doc. We would like to hear the feedbacks from the committee to make the alignment.
Also @simon-mo .
One example thing we want to discuss is how to add this new operator, how should the repo package structure look like.
Search before asking
Description
We would like to contribute a controller embedded in kuberay the operate a Ray Serve application on top of kuberay cluster.
This operator performs health checks, initial and redeployment of Serve app on kuberay cluster, and rotate cluster if the Serve application fails. The CR will exposes health checking status of Serve application.
You can find more information from this design doc
Conceptually this is similar to SparkJob and FlinkJob in their respective operator. It is a high level concept built on top of existing CRs.
Comparing to the Ray Jobs controller/CR design, service CR is designed to be long running and should outlive cluster failure. However, both workload uses Ray's REST API endpoint to perform operation on the Ray cluster.
Use case
Related issues
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: