Skip to content

Commit

Permalink
[Doc] GKE GPU cluster setup (#1223)
Browse files Browse the repository at this point in the history
GKE GPU cluster setup
  • Loading branch information
kevin85421 authored Jul 7, 2023
1 parent 19054cb commit 1ee5f95
Show file tree
Hide file tree
Showing 2 changed files with 75 additions and 2 deletions.
74 changes: 74 additions & 0 deletions docs/guidance/gcp-gke-gpu-cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Start Google Cloud GKE Cluster with GPUs for KubeRay

## Step 1: Create a Kubernetes cluster on GKE

Run this command and all following commands on your local machine or on the [Google Cloud Shell](https://cloud.google.com/shell). If running from your local machine, you will need to install the [Google Cloud SDK](https://cloud.google.com/sdk/docs/install). The following command creates a Kubernetes cluster named `kuberay-gpu-cluster` with 1 CPU node in the `us-west1-b` zone. In this example, we use the `e2-standard-4` machine type, which has 4 vCPUs and 16 GB RAM.

```sh
gcloud container clusters create kuberay-gpu-cluster \
--num-nodes=1 --min-nodes 0 --max-nodes 1 --enable-autoscaling \
--zone=us-west1-b --machine-type e2-standard-4
```

> Note: You can also create a cluster from the [Google Cloud Console](https://console.cloud.google.com/kubernetes/list).
## Step 2: Create a GPU node pool

Run the following command to create a GPU node pool for Ray GPU workers.
(You can also create it from the Google Cloud Console; see the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/node-taints#create_a_node_pool_with_node_taints) for more details.)

```sh
gcloud container node-pools create gpu-node-pool \
--accelerator type=nvidia-l4-vws,count=1 \
--zone us-west1-b \
--cluster kuberay-gpu-cluster \
--num-nodes 1 \
--min-nodes 0 \
--max-nodes 1 \
--enable-autoscaling \
--machine-type g2-standard-4 \
--node-taints=ray.io/node-type=worker:NoSchedule
```

The `--accelerator` flag specifies the type and number of GPUs for each node in the node pool. In this example, we use the [NVIDIA L4](https://cloud.google.com/compute/docs/gpus#l4-gpus) GPU. The machine type `g2-standard-4` has 1 GPU, 24 GB GPU Memory, 4 vCPUs and 16 GB RAM.

The taint `ray.io/node-type=worker:NoSchedule` prevents CPU-only Pods such as the Kuberay operator, Ray head, and CoreDNS Pods from being scheduled on this GPU node pool. This is because GPUs are expensive, so we want to use this node pool for Ray GPU workers only.

Concretely, any Pod that does not have the following toleration will not be scheduled on this GPU node pool:

```yaml
tolerations:
- key: ray.io/node-type
operator: Equal
value: worker
effect: NoSchedule
```
For more on taints and tolerations, see the [Kubernetes documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/).
## Step 3: Configure `kubectl` to connect to the cluster

Run the following command to download Google Cloud credentials and configure the Kubernetes CLI to use them.

```sh
gcloud container clusters get-credentials kuberay-gpu-cluster --zone us-west1-b
```

For more details, see the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl).

## Step 4: Install NVIDIA GPU device drivers

This step is required for GPU support on GKE. See the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers) for more details.

```sh
# Install NVIDIA GPU device driver
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
# Verify that your nodes have allocatable GPUs
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
# Example output:
# NAME GPU
# gke-kuberay-gpu-cluster-gpu-node-pool-xxxxx 1
# gke-kuberay-gpu-cluster-default-pool-xxxxx <none>
```
3 changes: 1 addition & 2 deletions docs/guidance/stable-diffusion-rayservice.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,7 @@ and [the Ray documentation](https://docs.ray.io/en/latest/serve/tutorials/stable

## Step 1: Create a Kubernetes cluster with GPUs

Follow [aws-eks-gpu-cluster.md](./aws-eks-gpu-cluster.md) to create an AWS EKS cluster with 1
CPU (`m5.xlarge`) node and 1 GPU (`g5.xlarge`) node.
Follow [aws-eks-gpu-cluster.md](./aws-eks-gpu-cluster.md) or [gcp-gke-gpu-cluster.md](./gcp-gke-gpu-cluster.md) to create a Kubernetes cluster with 1 CPU node and 1 GPU node.

## Step 2: Install the nightly KubeRay operator

Expand Down

0 comments on commit 1ee5f95

Please sign in to comment.