Skip to content

Commit

Permalink
ray integration
Browse files Browse the repository at this point in the history
  • Loading branch information
kevin85421 committed Feb 28, 2023
1 parent 75b2509 commit 38e4643
Show file tree
Hide file tree
Showing 10 changed files with 36,014 additions and 0 deletions.
26 changes: 26 additions & 0 deletions .github/workflows/ray_kind_test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: Build & Apply Ray manifest in KinD
on:
pull_request:
paths:
- contrib/ray/**

jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3

- name: Install KinD
run: ./tests/gh-actions/install_kind.sh

- name: Create KinD Cluster
run: kind create cluster --image=kindest/node:v1.23.0

- name: Install kustomize
run: ./tests/gh-actions/install_kustomize.sh

- name: Build & Apply manifests
run: |
cd contrib/ray/
make test
12 changes: 12 additions & 0 deletions contrib/ray/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
KUBERAY_RELEASE_VERSION ?= 0.4.0
KUBERAY_HELM_CHART_REPO ?= https://ray-project.github.io/kuberay-helm/

.PHONY: kuberay-operator/base
kuberay-operator/base:
mkdir -p kuberay-operator/base
cd kuberay-operator/base && helm template --include-crds kuberay-operator kuberay-operator --version ${KUBERAY_RELEASE_VERSION} --repo ${KUBERAY_HELM_CHART_REPO} > resources.yaml

.PHONY: test
test:
./test.sh

5 changes: 5 additions & 0 deletions contrib/ray/OWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
approvers:
- juliusvonkohout
reviewers:
- juliusvonkohout
- kimwnasptd
122 changes: 122 additions & 0 deletions contrib/ray/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
> Credit: This manifest refers a lot to the engineering blog ["Building a Machine Learning Platform with Kubeflow and Ray on Google Kubernetes Engine"](https://cloud.google.com/blog/products/ai-machine-learning/build-a-ml-platform-with-kubeflow-and-ray-on-gke) from Google Cloud.
# Ray
[Ray](https://github.com/ray-project/ray) is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for simplifying ML compute.

<figure>
<img
src="assets/map-of-ray.png"
alt="Ray">
<figcaption>Stack of Ray libraries - unified toolkit for ML workloads. (ref: https://docs.ray.io/en/latest/ray-overview/index.html)</figcaption>
</figure>

# KubeRay
[KubeRay](https://github.com/ray-project/kuberay) is an open-source Kubernetes operator for Ray. It provides several CRDs to simplify managing Ray clusters on Kubernetes. We will integrate Kubeflow and KubeRay in this document.

# Requirements
* Dependencies
* `kustomize`: v3.2.0 (Kubeflow manifest is sensitive to `kustomize` version.)
* `Kubernetes`: v1.23

* Computing resources:
* 16GB RAM
* 8 CPUs

# Example
![image](assets/architecture.png)

## Step 1: Install Kubeflow v1.6-branch
* This example installs Kubeflow with the [v1.6-branch](https://github.com/kubeflow/manifests/tree/v1.6-branch).

* Install all Kubeflow official components and all common services using [one command](https://github.com/kubeflow/manifests/tree/v1.6-branch#install-with-a-single-command).
* If you do not want to install all components, you can comment out **KNative**, **Katib**, **Tensorboards Controller**, **Tensorboard Web App**, **Training Operator**, and **KServe** from [example/kustomization.yaml](https://github.com/kubeflow/manifests/blob/v1.6-branch/example/kustomization.yaml).

## Step 2: Install KubeRay operator
```sh
# Install a KubeRay operator and custom resource definitions.
kustomize build kuberay-operator/base | kubectl apply --server-side -f -

# Check KubeRay operator
kubectl get pod -l app.kubernetes.io/component=kuberay-operator
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-5b8cd69758-rkpvh 1/1 Running 0 6m23s
```

## Step 3: Install RayCluster
```sh
# Create a RayCluster CR, and the KubeRay operator will reconcile a Ray cluster
# with 1 head Pod and 1 worker Pod.
kubectl apply -f raycluster_example.yaml

# Check RayCluster
kubectl get pod -l ray.io/cluster=kubeflow-raycluster
# NAME READY STATUS RESTARTS AGE
# kubeflow-raycluster-head-p6dpk 1/1 Running 0 70s
# kubeflow-raycluster-worker-small-group-l7j6c 1/1 Running 0 70s
```
* `raycluster_example.yaml` uses `rayproject/ray:2.2.0-py38-cpu` as its OCI image. Ray is very sensitive to the Python versions and Ray versions between the server (RayCluster) and client (Jupyter Notebook) sides. This image uses:
* Python 3.8.13
* Ray 2.2.0

## Step 4: Forward the port of Istio's Ingress-Gateway
* Follow the [instructions](https://github.com/kubeflow/manifests/tree/v1.6-branch#port-forward) to forward the port of Istio's Ingress-Gateway and log in to Kubeflow Central Dashboard.

## Step 5: Create a Jupyter notebook via Kubeflow Central Dashboard
* Click "Notebooks" icon in the left panel.
* Click "New Notebook"
* Select `kubeflownotebookswg/jupyter-scipy:v1.6.1` as OCI image.
* Click "Launch"
* Click "CONNECT" to connect into the notebook instance.

## Step 6: Use Ray client in the Jupyter notebook to connect to the RayCluster
* As I mentioned in Step 3, Ray is very sensitive to the Python versions and Ray versions between the server (RayCluster) and client (Jupyter Notebook) sides.
```sh
# Check Python version. The version's MAJOR and MINOR should match with RayCluster (i.e. Python 3.8)
python --version
# Python 3.8.10

# Install Ray 2.2.0
pip install -U ray[default]==2.2.0
```
* Connect to RayCluster via Ray client.
```python
# Open a new .ipynb page.
import ray
# ray://${RAYCLUSTER_HEAD_SVC}.${NAMESPACE}.svc.cluster.local:${RAY_CLIENT_PORT}
ray.init(address="ray://kubeflow-raycluster-head-svc.default.svc.cluster.local:10001")
print(ray.cluster_resources())
# {'node:10.244.0.41': 1.0, 'memory': 3000000000.0, 'node:10.244.0.40': 1.0, 'object_store_memory': 805386239.0, 'CPU': 2.0}
# Try Ray task
@ray.remote
def f(x):
return x * x
futures = [f.remote(i) for i in range(4)]
print(ray.get(futures)) # [0, 1, 4, 9]
# Try Ray actor
@ray.remote
class Counter(object):
def __init__(self):
self.n = 0
def increment(self):
self.n += 1
def read(self):
return self.n
counters = [Counter.remote() for i in range(4)]
[c.increment.remote() for c in counters]
futures = [c.read.remote() for c in counters]
print(ray.get(futures)) # [1, 1, 1, 1]
```
# Upgrading
```sh
# Step 1: Update KUBERAY_RELEASE_VERSION in Makefile
# Step 2: Create new KubeRay operator manifest
make kuberay-operator/base
```
Binary file added contrib/ray/assets/architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added contrib/ray/assets/map-of-ray.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
19 changes: 19 additions & 0 deletions contrib/ray/kuberay-operator/base/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
patches:
# Add securityContext to KubeRay operator Pod.
- target:
kind: Deployment
labelSelector: "app.kubernetes.io/name=kuberay-operator"
patch: |-
- op: add
path: /spec/template/spec/containers/0/securityContext
value:
runAsUser: 1000
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
resources:
- resources.yaml
Loading

0 comments on commit 38e4643

Please sign in to comment.