Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kuberay integration #2383

Merged
merged 18 commits into from
May 12, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions .github/workflows/ray_kind_test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: Build & Apply Ray manifest in KinD
on:
pull_request:
paths:
- contrib/ray/**

jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3

- name: Install KinD
run: ./tests/gh-actions/install_kind.sh

- name: Create KinD Cluster
run: kind create cluster --image=kindest/node:v1.23.0

- name: Install kustomize
run: ./tests/gh-actions/install_kustomize.sh

- name: Build & Apply manifests
run: |
cd contrib/ray/
make test
12 changes: 12 additions & 0 deletions contrib/ray/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
KUBERAY_RELEASE_VERSION ?= 0.4.0
KUBERAY_HELM_CHART_REPO ?= https://ray-project.github.io/kuberay-helm/

.PHONY: kuberay-operator/base
kuberay-operator/base:
mkdir -p kuberay-operator/base
cd kuberay-operator/base && helm template --include-crds kuberay-operator kuberay-operator --version ${KUBERAY_RELEASE_VERSION} --repo ${KUBERAY_HELM_CHART_REPO} > resources.yaml

.PHONY: test
test:
./test.sh

5 changes: 5 additions & 0 deletions contrib/ray/OWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
approvers:
- juliusvonkohout
reviewers:
- juliusvonkohout
- kimwnasptd
135 changes: 135 additions & 0 deletions contrib/ray/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
TODO
- The ray dashboard, worker and head must only be available from inside your kubeflow user namespace
- Reenable the istio sidecar for the ray head and worker in the user namespace and provide the corresponding Istio Authorizationpolicies. We can keep the istio sidecar for the deployment kuberay-operator in the namespace kubeflow, since it does NOT use a webhook, but something else to reconcile rayclusters. This means we also do not need a networkpolicy for the ray operator.


> Credit: This manifest refers a lot to the engineering blog ["Building a Machine Learning Platform with Kubeflow and Ray on Google Kubernetes Engine"](https://cloud.google.com/blog/products/ai-machine-learning/build-a-ml-platform-with-kubeflow-and-ray-on-gke) from Google Cloud.

# Ray
[Ray](https://github.com/ray-project/ray) is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for simplifying ML compute.

<figure>
<img
src="assets/map-of-ray.png"
alt="Ray">
<figcaption>Stack of Ray libraries - unified toolkit for ML workloads. (ref: https://docs.ray.io/en/latest/ray-overview/index.html)</figcaption>
</figure>

# KubeRay
[KubeRay](https://github.com/ray-project/kuberay) is an open-source Kubernetes operator for Ray. It provides several CRDs to simplify managing Ray clusters on Kubernetes. We will integrate Kubeflow and KubeRay in this document.

# Requirements
* Dependencies
* `kustomize`: v3.2.0 (Kubeflow manifest is sensitive to `kustomize` version.)
* `Kubernetes`: v1.23

* Computing resources:
* 16GB RAM
* 8 CPUs

# Example
<figure>
<img
src="assets/architecture.svg"
alt="ray/kubeflow integration">
<figcaption>Note: (1) Kubeflow Central Dashboard will be renamed to workbench in the future. (2) Kubeflow Pipeline (KFP) is an important component of Kubeflow, but it is not included in this example.</figcaption>
</figure>

## Step 1: Install Kubeflow v1.7-branch
* This example installs Kubeflow with the [v1.7-branch](https://github.com/kubeflow/manifests/tree/v1.7-branch).

* Install all Kubeflow official components and all common services using [one command](https://github.com/kubeflow/manifests/tree/v1.7-branch#install-with-a-single-command).
* If you do not want to install all components, you can comment out **KNative**, **Katib**, **Tensorboards Controller**, **Tensorboard Web App**, **Training Operator**, and **KServe** from [example/kustomization.yaml](https://github.com/kubeflow/manifests/blob/v1.7-branch/example/kustomization.yaml).

## Step 2: Install KubeRay operator

We never ever break Kubernetes standards and do not use the "default" namespace, but a proper one, in our case "kubeflow" for the ray operator.

```sh
# Install a KubeRay operator and custom resource definitions.
kustomize build kuberay-operator/base | kubectl apply --server-side -f -

# Check KubeRay operator
kubectl get pod -l app.kubernetes.io/component=kuberay-operator
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-5b8cd69758-rkpvh 1/1 Running 0 6m23s
```

## Step 3: Install RayCluster
```sh
# Create a RayCluster CR, and the KubeRay operator will reconcile a Ray cluster
# with 1 head Pod and 1 worker Pod.
# $MY_KUBEFLOW_USER_NAMESPACE is a proper Kubeflow user namespace with istio sidecar injection and never ever the wrong "default"
export MY_KUBEFLOW_USER_NAMESPACE=development
kubectl apply -f $raycluster_example.yaml -n $MY_KUBEFLOW_USER_NAMESPACE

# Check RayCluster
kubectl get pod -l ray.io/cluster=kubeflow-raycluster -n $MY_KUBEFLOW_USER_NAMESPACE
# NAME READY STATUS RESTARTS AGE
# kubeflow-raycluster-head-p6dpk 1/1 Running 0 70s
# kubeflow-raycluster-worker-small-group-l7j6c 1/1 Running 0 70s
```
* `raycluster_example.yaml` uses `rayproject/ray:2.2.0-py38-cpu` as its OCI image. Ray is very sensitive to the Python versions and Ray versions between the server (RayCluster) and client (JupyterLab) sides. This image uses:
* Python 3.8.13
* Ray 2.2.0

## Step 4: Forward the port of Istio's Ingress-Gateway
* Follow the [instructions](https://github.com/kubeflow/manifests/tree/v1.7-branch#port-forward) to forward the port of Istio's Ingress-Gateway and log in to Kubeflow Central Dashboard.

## Step 5: Create a JupyterLab via Kubeflow Central Dashboard
* Click "Notebooks" icon in the left panel.
* Click "New Notebook"
* Select `kubeflownotebookswg/jupyter-scipy:v1.7.0` as OCI image.
* Click "Launch"
* Click "CONNECT" to connect into the JupyterLab instance.

## Step 6: Use Ray client in the JupyterLab to connect to the RayCluster
* As I mentioned in Step 3, Ray is very sensitive to the Python versions and Ray versions between the server (RayCluster) and client (JupyterLab) sides.
```sh
# Check Python version. The version's MAJOR and MINOR should match with RayCluster (i.e. Python 3.8)
python --version
# Python 3.8.10

# Install Ray 2.2.0
pip install -U ray[default]==2.2.0
```
* Connect to RayCluster via Ray client.
```python
# Open a new .ipynb page.

import ray
# For other namespaces use ray://${RAYCLUSTER_HEAD_SVC}.${NAMESPACE}.svc.cluster.local:${RAY_CLIENT_PORT}
# But we use of course our per namespace ray cluster to have multi-tenancy and
# We never ever use "default" as namespace since this would violate Kubernetes standards
ray.init(address="ray://kubeflow-raycluster-head-svc:10001")
print(ray.cluster_resources())
# {'node:10.244.0.41': 1.0, 'memory': 3000000000.0, 'node:10.244.0.40': 1.0, 'object_store_memory': 805386239.0, 'CPU': 2.0}

# Try Ray task
@ray.remote
def f(x):
return x * x

futures = [f.remote(i) for i in range(4)]
print(ray.get(futures)) # [0, 1, 4, 9]

# Try Ray actor
@ray.remote
class Counter(object):
def __init__(self):
self.n = 0

def increment(self):
self.n += 1

def read(self):
return self.n

counters = [Counter.remote() for i in range(4)]
[c.increment.remote() for c in counters]
futures = [c.read.remote() for c in counters]
print(ray.get(futures)) # [1, 1, 1, 1]
```

# Upgrading
See [UPGRADE.md](UPGRADE.md) for more details.
6 changes: 6 additions & 0 deletions contrib/ray/UPGRADE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Upgrading
```sh
# Step 1: Update KUBERAY_RELEASE_VERSION in Makefile
# Step 2: Create new KubeRay operator manifest
make kuberay-operator/base
```
1 change: 1 addition & 0 deletions contrib/ray/assets/architecture.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added contrib/ray/assets/map-of-ray.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 20 additions & 0 deletions contrib/ray/kuberay-operator/base/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
patches:
# Add securityContext to KubeRay operator Pod.
- target:
kind: Deployment
labelSelector: "app.kubernetes.io/name=kuberay-operator"
patch: |-
- op: add
path: /spec/template/spec/containers/0/securityContext
value:
runAsUser: 1000
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
namespace: kubeflow
resources:
- namespace.yaml
- resources.yaml
4 changes: 4 additions & 0 deletions contrib/ray/kuberay-operator/base/namespace.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: v1
kind: Namespace
metadata:
name: kubeflow
Loading