Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc][KubeRay] Update PyTorch Mnist Training doc for KubeRay 1.2.0 #47321

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 17 additions & 16 deletions doc/source/cluster/kubernetes/examples/mnist-training-example.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,9 @@ curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operat
```

You might need to adjust some fields in the RayJob description YAML file so that it can run in your environment:
* `replicas` under `workerGroupSpecs` in `rayClusterSpec`: This field specifies the number of worker Pods that KubeRay schedules to the Kubernetes cluster. Each worker Pod and the head Pod, as described in the `template` field, requires 2 CPUs. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximum `replicas` value is 2 to allow all Pods to reach the `Running` status.
* `replicas` under `workerGroupSpecs` in `rayClusterSpec`: This field specifies the number of worker Pods that KubeRay schedules to the Kubernetes cluster. Each worker Pod requires 3 CPUs, and the head Pod requires 1 CPU, as described in the `template` field. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximum `replicas` value is 2 to allow all Pods to reach the `Running` status.
* `NUM_WORKERS` under `runtimeEnvYAML` in `spec`: This field indicates the number of Ray actors to launch (see `ScalingConfig` in this [Document](ray-train-configs-api) for more information). Each Ray actor must be served by a worker Pod in the Kubernetes cluster. Therefore, `NUM_WORKERS` must be less than or equal to `replicas`.
* `CPUS_PER_WORKER`: This must be set to less than or equal to `(CPU resource request per worker Pod) - 1`. For example, in the sample YAML file, the CPU resource request per worker Pod is 3 CPUs, so `CPUS_PER_WORKER` must be set to 2 or less.

```sh
# `replicas` and `NUM_WORKERS` set to 2.
Expand All @@ -37,12 +38,12 @@ kubectl apply -f ray-job.pytorch-mnist.yaml
# Check existing Pods: According to `replicas`, there should be 2 worker Pods.
# Make sure all the Pods are in the `Running` status.
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-6dddd689fb-ksmcs 1/1 Running 0 6m8s
# pytorch-mnist-raycluster-rkdmq-worker-small-group-c8bwx 1/1 Running 0 5m32s
# pytorch-mnist-raycluster-rkdmq-worker-small-group-s7wvm 1/1 Running 0 5m32s
# rayjob-pytorch-mnist-nxmj2 1/1 Running 0 4m17s
# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl 1/1 Running 0 5m32s
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-6dddd689fb-ksmcs 1/1 Running 0 6m8s
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-c8bwx 1/1 Running 0 5m32s
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-s7wvm 1/1 Running 0 5m32s
# rayjob-pytorch-mnist-nxmj2 1/1 Running 0 4m17s
# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl 1/1 Running 0 5m32s
```

Check that the RayJob is in the `RUNNING` status:
Expand All @@ -68,12 +69,12 @@ After seeing `JOB_STATUS` marked as `SUCCEEDED`, you can check the training logs
```sh
# Check Pods name.
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-6dddd689fb-ksmcs 1/1 Running 0 113m
# pytorch-mnist-raycluster-rkdmq-worker-small-group-c8bwx 1/1 Running 0 38m
# pytorch-mnist-raycluster-rkdmq-worker-small-group-s7wvm 1/1 Running 0 38m
# rayjob-pytorch-mnist-nxmj2 0/1 Completed 0 38m
# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl 1/1 Running 0 38m
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-6dddd689fb-ksmcs 1/1 Running 0 113m
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-c8bwx 1/1 Running 0 38m
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-s7wvm 1/1 Running 0 38m
# rayjob-pytorch-mnist-nxmj2 0/1 Completed 0 38m
# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl 1/1 Running 0 38m

# Check training logs.
kubectl logs -f rayjob-pytorch-mnist-nxmj2
Expand All @@ -83,9 +84,9 @@ kubectl logs -f rayjob-pytorch-mnist-nxmj2
# 2024-06-16 22:23:01,844 SUCC cli.py:61 -- Job 'rayjob-pytorch-mnist-l6ccc' submitted successfully
# 2024-06-16 22:23:01,844 SUCC cli.py:62 -- -------------------------------------------------------
# ...
# (RayTrainWorker pid=1138, ip=10.244.0.18)
# (RayTrainWorker pid=1138, ip=10.244.0.18)
# 0%| | 0/26421880 [00:00<?, ?it/s]
# (RayTrainWorker pid=1138, ip=10.244.0.18)
# (RayTrainWorker pid=1138, ip=10.244.0.18)
# 0%| | 32768/26421880 [00:00<01:27, 301113.97it/s]
# ...
# Training finished iteration 10 at 2024-06-16 22:33:05. Total running time: 7min 9s
Expand Down Expand Up @@ -117,4 +118,4 @@ Delete your RayJob with the following command:

```sh
kubectl delete -f ray-job.pytorch-mnist.yaml
```
```