From 1ee5f95c5322c6714c4154d06a52cf35157a10a3 Mon Sep 17 00:00:00 2001
From: Kai-Hsun Chen <kaihsun@anyscale.com>
Date: Fri, 7 Jul 2023 15:13:00 -0700
Subject: [PATCH] [Doc] GKE GPU cluster setup (#1223)

GKE GPU cluster setup
---
 docs/guidance/gcp-gke-gpu-cluster.md         | 74 ++++++++++++++++++++
 docs/guidance/stable-diffusion-rayservice.md |  3 +-
 2 files changed, 75 insertions(+), 2 deletions(-)
 create mode 100644 docs/guidance/gcp-gke-gpu-cluster.md

diff --git a/docs/guidance/gcp-gke-gpu-cluster.md b/docs/guidance/gcp-gke-gpu-cluster.md
new file mode 100644
index 0000000000..be0e9b1f1d
--- /dev/null
+++ b/docs/guidance/gcp-gke-gpu-cluster.md
@@ -0,0 +1,74 @@
+# Start Google Cloud GKE Cluster with GPUs for KubeRay
+
+## Step 1: Create a Kubernetes cluster on GKE
+
+Run this command and all following commands on your local machine or on the [Google Cloud Shell](https://cloud.google.com/shell). If running from your local machine, you will need to install the [Google Cloud SDK](https://cloud.google.com/sdk/docs/install). The following command creates a Kubernetes cluster named `kuberay-gpu-cluster` with 1 CPU node in the `us-west1-b` zone. In this example, we use the `e2-standard-4` machine type, which has 4 vCPUs and 16 GB RAM.
+
+```sh
+gcloud container clusters create kuberay-gpu-cluster \
+    --num-nodes=1 --min-nodes 0 --max-nodes 1 --enable-autoscaling \
+    --zone=us-west1-b --machine-type e2-standard-4
+```
+
+> Note: You can also create a cluster from the [Google Cloud Console](https://console.cloud.google.com/kubernetes/list).
+
+## Step 2: Create a GPU node pool
+
+Run the following command to create a GPU node pool for Ray GPU workers.
+(You can also create it from the Google Cloud Console; see the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/node-taints#create_a_node_pool_with_node_taints) for more details.)
+
+```sh
+gcloud container node-pools create gpu-node-pool \
+  --accelerator type=nvidia-l4-vws,count=1 \
+  --zone us-west1-b \
+  --cluster kuberay-gpu-cluster \
+  --num-nodes 1 \
+  --min-nodes 0 \
+  --max-nodes 1 \
+  --enable-autoscaling \
+  --machine-type g2-standard-4 \
+  --node-taints=ray.io/node-type=worker:NoSchedule 
+```
+
+The `--accelerator` flag specifies the type and number of GPUs for each node in the node pool. In this example, we use the [NVIDIA L4](https://cloud.google.com/compute/docs/gpus#l4-gpus) GPU. The machine type `g2-standard-4` has 1 GPU, 24 GB GPU Memory, 4 vCPUs and 16 GB RAM.
+
+The taint `ray.io/node-type=worker:NoSchedule` prevents CPU-only Pods such as the Kuberay operator, Ray head, and CoreDNS Pods from being scheduled on this GPU node pool. This is because GPUs are expensive, so we want to use this node pool for Ray GPU workers only.
+
+Concretely, any Pod that does not have the following toleration will not be scheduled on this GPU node pool:
+
+```yaml
+tolerations:
+- key: ray.io/node-type
+  operator: Equal
+  value: worker
+  effect: NoSchedule
+```
+
+For more on taints and tolerations, see the [Kubernetes documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/).
+
+## Step 3: Configure `kubectl` to connect to the cluster
+
+Run the following command to download Google Cloud credentials and configure the Kubernetes CLI to use them.
+
+```sh
+gcloud container clusters get-credentials kuberay-gpu-cluster --zone us-west1-b
+```
+
+For more details, see the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl).
+
+## Step 4: Install NVIDIA GPU device drivers
+
+This step is required for GPU support on GKE. See the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers) for more details.
+
+```sh
+# Install NVIDIA GPU device driver
+kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
+
+# Verify that your nodes have allocatable GPUs 
+kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
+
+# Example output:
+# NAME                                          GPU
+# gke-kuberay-gpu-cluster-gpu-node-pool-xxxxx   1
+# gke-kuberay-gpu-cluster-default-pool-xxxxx    <none>
+```
diff --git a/docs/guidance/stable-diffusion-rayservice.md b/docs/guidance/stable-diffusion-rayservice.md
index 2a4cd3c6f6..71b4a07acd 100644
--- a/docs/guidance/stable-diffusion-rayservice.md
+++ b/docs/guidance/stable-diffusion-rayservice.md
@@ -5,8 +5,7 @@ and [the Ray documentation](https://docs.ray.io/en/latest/serve/tutorials/stable
 
 ## Step 1: Create a Kubernetes cluster with GPUs
 
-Follow [aws-eks-gpu-cluster.md](./aws-eks-gpu-cluster.md) to create an AWS EKS cluster with 1
-CPU (`m5.xlarge`) node and 1 GPU (`g5.xlarge`) node.
+Follow [aws-eks-gpu-cluster.md](./aws-eks-gpu-cluster.md) or [gcp-gke-gpu-cluster.md](./gcp-gke-gpu-cluster.md) to create a Kubernetes cluster with 1 CPU node and 1 GPU node.
 
 ## Step 2: Install the nightly KubeRay operator