Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc][Website] Update KubeRay introduction and fix layout issues #1042

Merged
merged 4 commits into from
Apr 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 27 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,33 @@
[![Build Status](https://github.com/ray-project/kuberay/workflows/Go-build-and-test/badge.svg)](https://github.com/ray-project/kuberay/actions)
[![Go Report Card](https://goreportcard.com/badge/github.com/ray-project/kuberay)](https://goreportcard.com/report/github.com/ray-project/kuberay)

KubeRay is an open source toolkit to run Ray applications on Kubernetes.
It provides several tools to simplify managing Ray clusters on Kubernetes.

- Ray Operator
- Backend services to create/delete cluster resources
- Kubectl plugin/CLI to operate CRD objects
- Native Job and Serving integration with Clusters
- Data Scientist centric workspace for fast prototyping (incubating)
- Kubernetes event dumper for ray clusters/pod/services (future work)
- Operator Integration with Kubernetes node problem detector (future work)
KubeRay is a powerful, open-source Kubernetes operator that simplifies the deployment and management of [Ray](https://github.com/ray-project/ray) applications on Kubernetes. It offers several key components:

**KubeRay core**: This is the official, fully-maintained component of KubeRay that provides three custom resource definitions, RayCluster, RayJob, and RayService. These resources are designed to help you run a wide range of workloads with ease.

* **RayCluster**: KubeRay fully manages the lifecycle of RayCluster, including cluster creation/deletion, autoscaling, and ensuring fault tolerance.

* **RayJob**: With RayJob, KubeRay automatically creates a RayCluster and submits a job when the cluster is ready. You can also configure RayJob to automatically delete the RayCluster once the job finishes.

* **RayService**: RayService is made up of two parts: a RayCluster and a Ray Serve deployment graph. RayService offers zero-downtime upgrades for RayCluster and high availability.

**Comminity-managed components (optional)**: Some components are maintained by the KubeRay community.

* **KubeRay APIServer**: It provides a layer of simplified configuration for KubeRay resources. The KubeRay API server is used internally
by some organizations to back user interfaces for KubeRay resource management.

* **KubeRay Python client**: This Python client library provides APIs to handle RayCluster from your Python application.

* **KubeRay CLI**: KubeRay CLI provides the ability to manage KubeRay resources through command-line interface.

## KubeRay ecosystem

* [AWS Application Load Balancer](docs/guidance/ingress.md)
* [Nginx](docs/guidance/ingress.md)
* [Prometheus and Grafana](docs/guidance/prometheus-grafana.md)
* [Volcano](docs/guidance/volcano-integration.md)
* [MCAD](docs/guidance/kuberay-with-MCAD.md)
* [Kubeflow](docs/guidance/kubeflow-integration.md)

## Documentation

Expand Down
1 change: 1 addition & 0 deletions docs/guidance/ingress.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
## Ingress Usage

Here we provide some examples to show how to use ingress to access your Ray cluster.

* [Example: AWS Application Load Balancer (ALB) Ingress support on AWS EKS](#example-aws-application-load-balancer-alb-ingress-support-on-aws-eks)
* [Example: Manually setting up NGINX Ingress on KinD](#example-manually-setting-up-nginx-ingress-on-kind)

Expand Down
1 change: 1 addition & 0 deletions docs/guidance/kubeflow-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ kubectl get pod -l ray.io/cluster=raycluster-kuberay
# raycluster-kuberay-head-bz77b 1/1 Running 0 64s
# raycluster-kuberay-worker-workergroup-8gr5q 1/1 Running 0 63s
```

* This step uses `rayproject/ray:2.2.0-py38-cpu` as its image. Ray is very sensitive to the Python versions and Ray versions between the server (RayCluster) and client (JupyterLab) sides. This image uses:
* Python 3.8.13
* Ray 2.2.0
Expand Down
1 change: 1 addition & 0 deletions docs/guidance/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
In the RayCluster resource definition, we use `State` to represent the current status of the Ray cluster.

For now, there are three types of the status exposed by the RayCluster's status.state: `ready`, `unhealthy` and `failed`.

| State | Description |
| --------- | ----------------------------------------------------------------------------------------------- |
| ready | The Ray cluster is ready for use. |
Expand Down
12 changes: 7 additions & 5 deletions docs/guidance/pod-command.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,13 @@ Currently, for timing (1), we can set the container's `Command` and `Args` in Ra
command: ["echo 123"]
args: ["456"]
```

* Ray head Pod
* `spec.containers.0.command` is hardcoded with `["/bin/bash", "-lc", "--"]`.
* `spec.containers.0.args` contains two parts:
* (Part 1) **user-specified command**: A string concatenates `headGroupSpec.template.spec.containers.0.command` from RayCluster and `headGroupSpec.template.spec.containers.0.args` from RayCluster together.
* (Part 2) **ray start command**: The command is created based on `rayStartParams` specified in RayCluster. The command will look like `ulimit -n 65536; ray start ...`.
* To summarize, `spec.containers.0.args` will be `$(user-specified command) && $(ray start command)`.
* `spec.containers.0.command` is hardcoded with `["/bin/bash", "-lc", "--"]`.
* `spec.containers.0.args` contains two parts:
* (Part 1) **user-specified command**: A string concatenates `headGroupSpec.template.spec.containers.0.command` from RayCluster and `headGroupSpec.template.spec.containers.0.args` from RayCluster together.
* (Part 2) **ray start command**: The command is created based on `rayStartParams` specified in RayCluster. The command will look like `ulimit -n 65536; ray start ...`.
* To summarize, `spec.containers.0.args` will be `$(user-specified command) && $(ray start command)`.

* Example
```sh
Expand Down Expand Up @@ -128,6 +129,7 @@ lifecycle:
exec:
command: ["/bin/sh","-c","/home/ray/samples/ray_cluster_resources.sh"]
```

* We execute the script `ray_cluster_resources.sh` via the postStart hook. Based on [this document](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks), there is no guarantee that the hook will execute before the container ENTRYPOINT. Hence, we need to wait for RayCluster to finish initialization in `ray_cluster_resources.sh`.

* Example
Expand Down
4 changes: 3 additions & 1 deletion docs/guidance/prometheus-grafana.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ kubectl get all -n prometheus-system
# deployment.apps/prometheus-kube-prometheus-operator 1/1 1 1 46s
# deployment.apps/prometheus-kube-state-metrics 1/1 1 1 46s
```

* KubeRay provides an [install.sh script](../../install/prometheus/install.sh) to install the [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) chart and related custom resources, including **ServiceMonitor**, **PodMonitor** and **PrometheusRule**, in the namespace `prometheus-system` automatically.

## Step 3: Install a KubeRay operator
Expand Down Expand Up @@ -92,9 +93,9 @@ spec:
targetLabels:
- ray.io/cluster
```

* The YAML example above is [serviceMonitor.yaml](../../config/prometheus/serviceMonitor.yaml), and it is created by **install.sh**. Hence, no need to create anything here.
* See [ServiceMonitor official document](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#servicemonitor) for more details about the configurations.

* `release: $HELM_RELEASE`: Prometheus can only detect ServiceMonitor with this label.

<div id="prometheus-can-only-detect-this-label" ></div>
Expand Down Expand Up @@ -156,6 +157,7 @@ spec:
podMetricsEndpoints:
- port: metrics
```

* `release: $HELM_RELEASE`: Prometheus can only detect PodMonitor with this label. See [here](#prometheus-can-only-detect-this-label) for more details.

* **PodMonitor** in `namespaceSelector` and `selector` are used to select Kubernetes Pods.
Expand Down
2 changes: 2 additions & 0 deletions docs/guidance/tls.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ kubectl apply -f ray-operator/config/samples/ray-cluster.tls.yaml
```

`ray-cluster.tls.yaml` will create:

* A Kubernetes Secret containing the CA's private key (`ca.key`) and self-signed certificate (`ca.crt`) (**Step 1**)
* A Kubernetes ConfigMap containing the scripts `gencert_head.sh` and `gencert_worker.sh`, which allow Ray Pods to generate private keys
(`tls.key`) and self-signed certificates (`tls.crt`) (**Step 2**)
Expand Down Expand Up @@ -75,6 +76,7 @@ openssl x509 -in ca.crt -noout -text
# (Note: You should comment out the Kubernetes Secret in `ray-cluster.tls.yaml`.)
kubectl create secret generic ca-tls --from-file=ca.key --from-file=ca.crt
```

* `ca.key`: CA's private key
* `ca.crt`: CA's self-signed certificate

Expand Down
34 changes: 25 additions & 9 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,17 +24,33 @@

## KubeRay

KubeRay is an open source toolkit to run Ray applications on Kubernetes.
KubeRay is a powerful, open-source Kubernetes operator that simplifies the deployment and management of [Ray](https://github.com/ray-project/ray) applications on Kubernetes. It offers several key components:

KubeRay provides several tools to simplify managing Ray clusters on Kubernetes.
**KubeRay core**: This is the official, fully-maintained component of KubeRay that provides three custom resource definitions, RayCluster, RayJob, and RayService. These resources are designed to help you run a wide range of workloads with ease.

* **RayCluster**: KubeRay fully manages the lifecycle of RayCluster, including cluster creation/deletion, autoscaling, and ensuring fault tolerance.

- Ray Operator
- Backend services to create/delete cluster resources
- Kubectl plugin/CLI to operate CRD objects
- Native Job and Serving integration with Clusters
- Data Scientist centric workspace for fast prototyping (incubating)
- Kubernetes event dumper for ray clusters/pod/services (future work)
- Operator Integration with Kubernetes node problem detector (future work)
* **RayJob**: With RayJob, KubeRay automatically creates a RayCluster and submits a job when the cluster is ready. You can also configure RayJob to automatically delete the RayCluster once the job finishes.

* **RayService**: RayService is made up of two parts: a RayCluster and a Ray Serve deployment graph. RayService offers zero-downtime upgrades for RayCluster and high availability.

**Comminity-managed components (optional)**: Some components are maintained by the KubeRay community.

* **KubeRay APIServer**: It provides a layer of simplified configuration for KubeRay resources. The KubeRay API server is used internally
by some organizations to back user interfaces for KubeRay resource management.

* **KubeRay Python client**: This Python client library provides APIs to handle RayCluster from your Python application.

* **KubeRay CLI**: KubeRay CLI provides the ability to manage KubeRay resources through command-line interface.

## KubeRay ecosystem

* [AWS Application Load Balancer](guidance/ingress/#example-aws-application-load-balancer-alb-ingress-support-on-aws-eks)
* [Nginx](guidance/ingress/#example-manually-setting-up-nginx-ingress-on-kind)
* [Prometheus and Grafana](guidance/prometheus-grafana/)
* [Volcano](guidance/volcano-integration/)
* [MCAD](guidance/kuberay-with-MCAD/)
* [Kubeflow](guidance/kubeflow-integration/)

## Security

Expand Down