[Feature] Ray Serve CR and Controller #214

simon-mo · 2022-03-26T04:35:00Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

We would like to contribute a controller embedded in kuberay the operate a Ray Serve application on top of kuberay cluster.

apiVersion: serve.ray.io/v1
kind: ServingCluster
metadata:
  name: .
  # status is populated by the operator, user can `kubectl get serve_deployments name | jq .medata.status` to receive the field.
  status: UPDATING|HEALTHY|UNHEALTHY
spec:
  healthCheckConfig:  # optional
    health_period_s: 5s
    consecutive_failures_threshold: 3

  serveConfig:
    - deploymentClass: .
      numReplicas: 2
      rayActorOptions: .

  rayClusterConfig:
    apiVersion: cluster.ray.io/v1
    kind: RayCluster
    metadata:
      generatedName: .
    spec:
      maxWorkers: 2
      podTypes:
        - name: head
          rayResources: .
          podConfig:
            apiVersion: 1
            kind: Pod
            metadata:
              generatedName: .
            spec:
              containers:
                - name: ray-node
                  image: my_registry/container:v1

This operator performs health checks, initial and redeployment of Serve app on kuberay cluster, and rotate cluster if the Serve application fails. The CR will exposes health checking status of Serve application.

You can find more information from this design doc

Conceptually this is similar to SparkJob and FlinkJob in their respective operator. It is a high level concept built on top of existing CRs.

Comparing to the Ray Jobs controller/CR design, service CR is designed to be long running and should outlive cluster failure. However, both workload uses Ray's REST API endpoint to perform operation on the Ray cluster.

Use case

Deploy Ray Serve application reliability on K8s cluster.
Manage Serve application in a cloud native way
Entrypoint to highly available application on Ray

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

brucez-anyscale · 2022-05-16T22:59:21Z

Hi. We have finalized some discuss about the design of the new k8s operator for Serve Deployment and RayCluster management. Here is our design doc. We would like to hear the feedbacks from the committee to make the alignment.
Also @simon-mo .
One example thing we want to discuss is how to add this new operator, how should the repo package structure look like.

DmitriGekhtman · 2023-01-14T01:10:38Z

This has been done.

simon-mo added the enhancement New feature or request label Mar 26, 2022

This was referenced May 18, 2022

KubeRay: Relocate files to enable controller extension with Kubebuilder Waynegates/kuberay#2

Closed

KubeRay: Relocate files to enable controller extension with Kubebuilder #268

Merged

KubeRay: kubebuilder creat RayService Controller and CR #270

Merged

Jeffwan added the operator label May 30, 2022

DmitriGekhtman closed this as completed Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Ray Serve CR and Controller #214

[Feature] Ray Serve CR and Controller #214

simon-mo commented Mar 26, 2022

brucez-anyscale commented May 16, 2022

DmitriGekhtman commented Jan 14, 2023

[Feature] Ray Serve CR and Controller #214

[Feature] Ray Serve CR and Controller #214

Comments

simon-mo commented Mar 26, 2022

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

brucez-anyscale commented May 16, 2022

DmitriGekhtman commented Jan 14, 2023