Skip to content

Commit

Permalink
[Doc][KubeRay] Update PyTorch Mnist Training doc for KubeRay 1.2.0 (#…
Browse files Browse the repository at this point in the history
…47321)

Signed-off-by: Chi-Sheng Liu <[email protected]>
  • Loading branch information
MortalHappiness authored Sep 4, 2024
1 parent d81f4d8 commit cefedcf
Showing 1 changed file with 14 additions and 13 deletions.
27 changes: 14 additions & 13 deletions doc/source/cluster/kubernetes/examples/mnist-training-example.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,9 @@ curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operat
```

You might need to adjust some fields in the RayJob description YAML file so that it can run in your environment:
* `replicas` under `workerGroupSpecs` in `rayClusterSpec`: This field specifies the number of worker Pods that KubeRay schedules to the Kubernetes cluster. Each worker Pod and the head Pod, as described in the `template` field, requires 2 CPUs. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximum `replicas` value is 2 to allow all Pods to reach the `Running` status.
* `replicas` under `workerGroupSpecs` in `rayClusterSpec`: This field specifies the number of worker Pods that KubeRay schedules to the Kubernetes cluster. Each worker Pod requires 3 CPUs, and the head Pod requires 1 CPU, as described in the `template` field. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximum `replicas` value is 2 to allow all Pods to reach the `Running` status.
* `NUM_WORKERS` under `runtimeEnvYAML` in `spec`: This field indicates the number of Ray actors to launch (see `ScalingConfig` in this [Document](ray-train-configs-api) for more information). Each Ray actor must be served by a worker Pod in the Kubernetes cluster. Therefore, `NUM_WORKERS` must be less than or equal to `replicas`.
* `CPUS_PER_WORKER`: This must be set to less than or equal to `(CPU resource request per worker Pod) - 1`. For example, in the sample YAML file, the CPU resource request per worker Pod is 3 CPUs, so `CPUS_PER_WORKER` must be set to 2 or less.

```sh
# `replicas` and `NUM_WORKERS` set to 2.
Expand All @@ -37,12 +38,12 @@ kubectl apply -f ray-job.pytorch-mnist.yaml
# Check existing Pods: According to `replicas`, there should be 2 worker Pods.
# Make sure all the Pods are in the `Running` status.
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-6dddd689fb-ksmcs 1/1 Running 0 6m8s
# pytorch-mnist-raycluster-rkdmq-worker-small-group-c8bwx 1/1 Running 0 5m32s
# pytorch-mnist-raycluster-rkdmq-worker-small-group-s7wvm 1/1 Running 0 5m32s
# rayjob-pytorch-mnist-nxmj2 1/1 Running 0 4m17s
# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl 1/1 Running 0 5m32s
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-6dddd689fb-ksmcs 1/1 Running 0 6m8s
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-c8bwx 1/1 Running 0 5m32s
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-s7wvm 1/1 Running 0 5m32s
# rayjob-pytorch-mnist-nxmj2 1/1 Running 0 4m17s
# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl 1/1 Running 0 5m32s
```

Check that the RayJob is in the `RUNNING` status:
Expand All @@ -68,12 +69,12 @@ After seeing `JOB_STATUS` marked as `SUCCEEDED`, you can check the training logs
```sh
# Check Pods name.
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-6dddd689fb-ksmcs 1/1 Running 0 113m
# pytorch-mnist-raycluster-rkdmq-worker-small-group-c8bwx 1/1 Running 0 38m
# pytorch-mnist-raycluster-rkdmq-worker-small-group-s7wvm 1/1 Running 0 38m
# rayjob-pytorch-mnist-nxmj2 0/1 Completed 0 38m
# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl 1/1 Running 0 38m
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-6dddd689fb-ksmcs 1/1 Running 0 113m
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-c8bwx 1/1 Running 0 38m
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-s7wvm 1/1 Running 0 38m
# rayjob-pytorch-mnist-nxmj2 0/1 Completed 0 38m
# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl 1/1 Running 0 38m

# Check training logs.
kubectl logs -f rayjob-pytorch-mnist-nxmj2
Expand Down

0 comments on commit cefedcf

Please sign in to comment.