Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Ray Serve CR and Controller #214

Closed
2 tasks done
simon-mo opened this issue Mar 26, 2022 · 2 comments
Closed
2 tasks done

[Feature] Ray Serve CR and Controller #214

simon-mo opened this issue Mar 26, 2022 · 2 comments
Labels
enhancement New feature or request operator

Comments

@simon-mo
Copy link
Collaborator

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

We would like to contribute a controller embedded in kuberay the operate a Ray Serve application on top of kuberay cluster.

apiVersion: serve.ray.io/v1
kind: ServingCluster
metadata:
  name: .
  # status is populated by the operator, user can `kubectl get serve_deployments name | jq .medata.status` to receive the field.
  status: UPDATING|HEALTHY|UNHEALTHY
spec:
  healthCheckConfig:  # optional
    health_period_s: 5s
    consecutive_failures_threshold: 3

  serveConfig:
    - deploymentClass: .
      numReplicas: 2
      rayActorOptions: .

  rayClusterConfig:
    apiVersion: cluster.ray.io/v1
    kind: RayCluster
    metadata:
      generatedName: .
    spec:
      maxWorkers: 2
      podTypes:
        - name: head
          rayResources: .
          podConfig:
            apiVersion: 1
            kind: Pod
            metadata:
              generatedName: .
            spec:
              containers:
                - name: ray-node
                  image: my_registry/container:v1

This operator performs health checks, initial and redeployment of Serve app on kuberay cluster, and rotate cluster if the Serve application fails. The CR will exposes health checking status of Serve application.

You can find more information from this design doc

Conceptually this is similar to SparkJob and FlinkJob in their respective operator. It is a high level concept built on top of existing CRs.

Comparing to the Ray Jobs controller/CR design, service CR is designed to be long running and should outlive cluster failure. However, both workload uses Ray's REST API endpoint to perform operation on the Ray cluster.

Use case

  • Deploy Ray Serve application reliability on K8s cluster.
  • Manage Serve application in a cloud native way
  • Entrypoint to highly available application on Ray

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@simon-mo simon-mo added the enhancement New feature or request label Mar 26, 2022
@brucez-anyscale
Copy link
Contributor

Hi. We have finalized some discuss about the design of the new k8s operator for Serve Deployment and RayCluster management. Here is our design doc. We would like to hear the feedbacks from the committee to make the alignment.
Also @simon-mo .
One example thing we want to discuss is how to add this new operator, how should the repo package structure look like.

@DmitriGekhtman
Copy link
Collaborator

This has been done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request operator
Projects
None yet
Development

No branches or pull requests

4 participants