Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kuberay int with MCAD #598

Merged
merged 8 commits into from
Sep 28, 2022
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions docs/deploy/kuberay-with-MCAD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Kuberay integration with MCAD (Multi-Cluster-App-Dispatcher)

asm582 marked this conversation as resolved.
Show resolved Hide resolved
The multi-cluster-app-dispatcher is a Kubernetes controller providing mechanisms for applications to manage batch jobs in a single or multi-cluster environment. For more details please refer [here](https://github.com/IBM/multi-cluster-app-dispatcher).

## Use case

MCAD allows you to deploy Ray cluster with a guarantee that sufficient resources are available in the cluster prior to actual pod creation in the Kubernetes cluster. It supports features such as:

- Integrates with upstream Kubernetes scheduling stack for features such co-scheduling, Packing on GPU dimension etc.
- Ability to wrap any Kubernetes objects.
- Increases control plane stability by JIT (Just-in Time) object creation.
- Queuing with policies.
- Quota management that goes across namespaces.
- Dispatching jobs to any one of the Kubernetes clusters.
asm582 marked this conversation as resolved.
Show resolved Hide resolved


In order to queue Ray cluster(s) and `gang dispatch` them when aggregated resources are available please refer to the setup [Kuberay-MCAD integration](https://github.com/IBM/multi-cluster-app-dispatcher/blob/quota-management/doc/usage/examples/kuberay/kuberay-mcad.md) with configuration files [here](https://github.com/IBM/multi-cluster-app-dispatcher/tree/quota-management/doc/usage/examples/kuberay/config).

## Submitting kuberay cluster to MCAD

Let's submit two Ray clusters on the same Kubernetes cluster.

- Assuming you have installed all the pre-requisites mentioned in the [Kuberay-MCAD integration](https://github.com/IBM/multi-cluster-app-dispatcher/blob/quota-management/doc/usage/examples/kuberay/kuberay-mcad.md), we submit the first Ray cluster using command `kubectl create -f aw-raycluster.yaml` using config file [here](https://github.com/IBM/multi-cluster-app-dispatcher/blob/quota-management/doc/usage/examples/kuberay/config/aw-raycluster.yaml).

```
Conditions:
Last Transition Micro Time: 2022-09-27T21:07:34.252275Z
Last Update Micro Time: 2022-09-27T21:07:34.252273Z
Status: True
Type: Init
Last Transition Micro Time: 2022-09-27T21:07:34.252535Z
Last Update Micro Time: 2022-09-27T21:07:34.252534Z
Reason: AwaitingHeadOfLine
Status: True
Type: Queueing
Last Transition Micro Time: 2022-09-27T21:07:34.261174Z
Last Update Micro Time: 2022-09-27T21:07:34.261174Z
Reason: FrontOfQueue.
Status: True
Type: HeadOfLine
Last Transition Micro Time: 2022-09-27T21:07:34.316208Z
Last Update Micro Time: 2022-09-27T21:07:34.316208Z
Reason: AppWrapperRunnable
Status: True
Type: Dispatched
Controllerfirsttimestamp: 2022-09-27T21:07:34.251877Z
Filterignore: true
Queuejobstate: Dispatched
Sender: before manageQueueJob - afterEtcdDispatching
State: Running
Events: <none>
(base) asmalvan@mcad-dev:~/mcad-kuberay$ kubectl get pods
NAME READY STATUS RESTARTS AGE
raycluster-autoscaler-1-head-9s4x5 2/2 Running 0 47s
raycluster-autoscaler-1-worker-small-group-4s6jv 1/1 Running 0 47s
```
- As seen the cluster is dispatched and pods are running.

- Let's submit another Ray cluster and see it queued without creating pending pods using command `kubectl create -f aw-raycluster.yaml`, note please change cluster name from `name: raycluster-autoscaler` to `name: raycluster-autoscaler-1` and re-submit
asm582 marked this conversation as resolved.
Show resolved Hide resolved

```
Conditions:
Last Transition Micro Time: 2022-09-27T21:11:06.162080Z
Last Update Micro Time: 2022-09-27T21:11:06.162080Z
Status: True
Type: Init
Last Transition Micro Time: 2022-09-27T21:11:06.162401Z
Last Update Micro Time: 2022-09-27T21:11:06.162401Z
Reason: AwaitingHeadOfLine
Status: True
Type: Queueing
Last Transition Micro Time: 2022-09-27T21:11:06.171619Z
Last Update Micro Time: 2022-09-27T21:11:06.171618Z
Reason: FrontOfQueue.
Status: True
Type: HeadOfLine
Last Transition Micro Time: 2022-09-27T21:11:06.179694Z
Last Update Micro Time: 2022-09-27T21:11:06.179689Z
Message: Insufficient resources to dispatch AppWrapper.
Reason: AppWrapperNotRunnable.
Status: True
Type: Backoff
Controllerfirsttimestamp: 2022-09-27T21:11:06.161797Z
Filterignore: true
Queuejobstate: HeadOfLine
Sender: before ScheduleNext - setHOL
State: Pending
Events: <none>
```


- As seen the second Ray cluster is queued with no pending pods created.

- Dispatching policy out of the box is FIFO which can be augmented as per user needs. The second cluster will be dispatched when additional aggregated resources are available in the cluster or the first AppWrapper Ray cluster is deleted.

1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ nav:
- KubeRay Operator: components/operator.md
- KubeRay ApiServer: components/apiserver.md
- KubeRay CLI: components/cli.md
- Kuberay with MCAD: deploy/kuberay-with-MCAD.md
asm582 marked this conversation as resolved.
Show resolved Hide resolved
- Features:
- RayService: guidance/rayservice.md
- RayJob: guidance/rayjob.md
Expand Down