Skip to content

Commit

Permalink
feat: robert self-test
Browse files Browse the repository at this point in the history
This PR adds both one-off and periodic self-test capability that runs RoBERTa on 1 gpu as a Kubernetes Job/CronJob. No client-side support is needed, except for the one-time creation of these resources.

also: make sure the self-test yamls reflect the latest image versions
  • Loading branch information
starpit committed Aug 28, 2022
1 parent e59bee2 commit f2fbfd2
Show file tree
Hide file tree
Showing 14 changed files with 370 additions and 4 deletions.
3 changes: 2 additions & 1 deletion .dockerignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
store/**/*.md
store/**/*.md
*~
2 changes: 1 addition & 1 deletion .github/workflows/self-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,4 @@ jobs:
TERM: xterm
DEBUG_KUBERNETES: true
TEST_LOG_AGGREGATOR: true
run: ./deploy/self-test/run.sh
run: ./tests/self-test/run.sh
1 change: 1 addition & 0 deletions deploy/self-test/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
github.yaml
80 changes: 80 additions & 0 deletions deploy/self-test/roberta/1gpu/cron.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: codeflare-self-test-serviceaccount
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: codeflare-self-test-role
rules:
- apiGroups: [""]
resources: ["pods", "pods/exec", "services", "events", "secrets"]
verbs: ["create", "delete", "get", "watch", "list"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["create", "delete", "get", "watch", "list"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "get", "watch", "list"]
- apiGroups: [""]
resources: ["pods/exec", "pods/portforward"]
verbs: ["create", "delete"]
#- apiGroups: ["apps"]
# resources: [deployments]
# verbs: [get, list]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: codeflare-self-test-rolebinding
subjects:
- kind: ServiceAccount
name: codeflare-self-test-serviceaccount
roleRef:
kind: Role
name: codeflare-self-test-role
apiGroup: rbac.authorization.k8s.io

---
apiVersion: batch/v1
kind: CronJob
metadata:
name: codeflare-self-test-roberta-1gpu-periodic
spec:
schedule: "0/30 * * * *" # every 30 minutes, starting from the top of the hour (see crontab.guru)
jobTemplate:
spec:
concurrencyPolicy: Forbid
failedJobsHistoryLimit: 1000
successfulJobsHistoryLimit: 1000
template:
spec:
serviceAccountName: codeflare-self-test-serviceaccount
containers:
- name: self-test
image: ghcr.io/project-codeflare/codeflare-self-test:0.10.4
env:
# - name: GUIDEBOOK_RUN_ARGS
# value: "-V"
- name: VARIANTS
value: roberta-1gpu
- name: ML_CODEFLARE_ROBERTA_GITHUB_USER
valueFrom:
secretKeyRef:
name: github
key: GITHUB_USER
- name: ML_CODEFLARE_ROBERTA_GITHUB_TOKEN
valueFrom:
secretKeyRef:
name: github
key: GITHUB_TOKEN
- name: MODE
value: development # otherwise building codeflare-cli takes a huge amount of memory
- name: KUBE_CONTEXT_FOR_TEST
value: kind-codeflare-test # must match with tests/kind/profiles/...
- name: KUBE_NS_FOR_TEST
value: default # must match with tests/kind/profiles/...
- name: CODEFLARE_NAMESPACE_RESTRICTED # restrict use of cluster-scoped resources
value: "true"
restartPolicy: Never
77 changes: 77 additions & 0 deletions deploy/self-test/roberta/1gpu/once.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: codeflare-self-test-serviceaccount
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: codeflare-self-test-role
rules:
- apiGroups: [""]
resources: ["pods", "pods/exec", "services", "events", "secrets"]
verbs: ["create", "delete", "get", "watch", "list"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["create", "delete", "get", "watch", "list"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "get", "watch", "list"]
- apiGroups: [""]
resources: ["pods/exec", "pods/portforward"]
verbs: ["create", "delete"]
#- apiGroups: ["apps"]
# resources: [deployments]
# verbs: [get, list]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: codeflare-self-test-rolebinding
subjects:
- kind: ServiceAccount
name: codeflare-self-test-serviceaccount
roleRef:
kind: Role
name: codeflare-self-test-role
apiGroup: rbac.authorization.k8s.io

---
apiVersion: batch/v1
kind: Job
metadata:
name: codeflare-self-test-roberta-1gpu-once
spec:
completions: 1
parallelism: 1
template:
spec:
serviceAccountName: codeflare-self-test-serviceaccount
containers:
- name: self-test
image: ghcr.io/project-codeflare/codeflare-self-test:0.10.4
env:
# - name: GUIDEBOOK_RUN_ARGS
# value: "-V"
- name: VARIANTS
value: roberta-1gpu
- name: ML_CODEFLARE_ROBERTA_GITHUB_USER
valueFrom:
secretKeyRef:
name: github
key: GITHUB_USER
- name: ML_CODEFLARE_ROBERTA_GITHUB_TOKEN
valueFrom:
secretKeyRef:
name: github
key: GITHUB_TOKEN
- name: MODE
value: development # otherwise building codeflare-cli takes a huge amount of memory
- name: KUBE_CONTEXT_FOR_TEST
value: kind-codeflare-test # must match with tests/kind/profiles/...
- name: KUBE_NS_FOR_TEST
value: default # must match with tests/kind/profiles/...
- name: CODEFLARE_NAMESPACE_RESTRICTED # restrict use of cluster-scoped resources
value: "true"
restartPolicy: Never
backoffLimit: 1
77 changes: 77 additions & 0 deletions deploy/self-test/roberta/1gpu/periodic.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: codeflare-self-test-serviceaccount
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: codeflare-self-test-role
rules:
- apiGroups: [""]
resources: ["pods", "pods/exec", "services", "events", "secrets"]
verbs: ["create", "delete", "get", "watch", "list"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["create", "delete", "get", "watch", "list"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "get", "watch", "list"]
- apiGroups: [""]
resources: ["pods/exec", "pods/portforward"]
verbs: ["create", "delete"]
#- apiGroups: ["apps"]
# resources: [deployments]
# verbs: [get, list]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: codeflare-self-test-rolebinding
subjects:
- kind: ServiceAccount
name: codeflare-self-test-serviceaccount
roleRef:
kind: Role
name: codeflare-self-test-role
apiGroup: rbac.authorization.k8s.io

---
apiVersion: batch/v1
kind: CronJob
metadata:
name: codeflare-self-test-roberta-1gpu
spec:
schedule: "*/30 * * * *" # every 30 minutes see crontab.guru
jobTemplate:
spec:
template:
spec:
serviceAccountName: codeflare-self-test-serviceaccount
containers:
- name: self-test
image: ghcr.io/project-codeflare/codeflare-self-test:0.10.4
env:
# - name: GUIDEBOOK_RUN_ARGS
# value: "-V"
- name: VARIANTS
value: roberta-1gpu
- name: ML_CODEFLARE_ROBERTA_GITHUB_USER
valueFrom:
secretKeyRef:
name: github
key: GITHUB_USER
- name: ML_CODEFLARE_ROBERTA_GITHUB_TOKEN
valueFrom:
secretKeyRef:
name: github
key: GITHUB_TOKEN
- name: MODE
value: development # otherwise building codeflare-cli takes a huge amount of memory
- name: KUBE_CONTEXT_FOR_TEST
value: kind-codeflare-test # must match with tests/kind/profiles/...
- name: KUBE_NS_FOR_TEST
value: default # must match with tests/kind/profiles/...
- name: CODEFLARE_NAMESPACE_RESTRICTED # restrict use of cluster-scoped resources
value: "true"
restartPolicy: Never
2 changes: 2 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,8 @@
"infile": "CHANGELOG.md"
},
"@release-it/bumper": {
"out": "deploy/self-test/self-test.yaml",
"out": "deploy/self-test/self-test-roberta.yaml",
"out": "plugins/plugin-client-default/package.json"
}
}
Expand Down
25 changes: 25 additions & 0 deletions tests/kind/profiles/roberta-1gpu/keep-it-simple
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"name": "keep-it-simple",
"creationTime": 1660657756574,
"lastModifiedTime": 1661642588298,
"lastUsedTime": 1661643221215,
"choices": {
"madwizard/apriori/use-gpu": "don't use gpus",
"madwizard/apriori/arch": "x64",
"madwizard/apriori/platform": "darwin",
"madwizard/apriori/mac-installer": "Homebrew",
"madwizard/apriori/in-terminal": "HTML",
"Start a new Run####Connect Dashboard to an existing Run####Boot up a Cloud Computer####Shut down a Cloud Computer": "Start a new Run",
"Run with CodeFlare Model Architecture####Bring Your Own Code####Demos": "Run with CodeFlare Model Architecture",
"Training Tasks####Fine Tuning Tasks": "Training Tasks",
"Train a Masked Language Model": "Train a Masked Language Model",
"I want to run a quick test with sample data####I have my own custom input data on S3": "I want to run a quick test with sample data",
"AWS####IBM####My data is not stored in S3": "My data is not stored in S3",
"Run Locally####Run on a Kubernetes Cluster": "Run on a Kubernetes Cluster",
"My Cluster is Running Locally####My Cluster is Running on Kubernetes": "My Cluster is Running on Kubernetes",
"expand((kubectl config get-contexts -o name | grep -E . >& /dev/null && kubectl config get-contexts -o name) || (kubectl version | grep Server >& /dev/null && echo \"${KUBE_CONTEXT_FOR_TEST-In-cluster}\" || exit 1), Kubernetes contexts)": "kind-codeflare-test",
"expand([ -z ${KUBE_CONTEXT} ] && exit 1 || X=$([ -n \"$KUBE_NS_FOR_TEST\" ] && echo $KUBE_NS_FOR_TEST || kubectl ${KUBE_CONTEXT_ARG} get ns -o name || oc ${KUBE_CONTEXT_ARG} get projects -o name); echo \"$X\" | sed -E 's#(namespace|project.project.openshift.io)/##' | grep -Ev 'openshift|kube-', Kubernetes namespaces)####Create a namespace": "default",
"Number of CPUs####Number of GPUs####Minimum Workers####Maximum Workers####Worker Memory####Head Memory": "{\"Number of CPUs\":\"1\",\"Number of GPUs\":\"1\",\"Minimum Workers\":\"1\",\"Maximum Workers\":\"1\",\"Worker Memory\":\"8Gi\",\"Head Memory\":\"8Gi\"}",
"Keep It Simple####Use the Ray Autoscaler####Use the Multi-user Enhanced Kubernetes Scheduler": "Keep It Simple"
}
}
26 changes: 26 additions & 0 deletions tests/kind/profiles/roberta-1gpu/mcad-coscheduler
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"name": "mcad-coscheduler",
"creationTime": 1660657756574,
"lastModifiedTime": 1660747919298,
"lastUsedTime": 1660755725660,
"choices": {
"madwizard/apriori/use-gpu": "don't use gpus",
"madwizard/apriori/arch": "x64",
"madwizard/apriori/platform": "darwin",
"madwizard/apriori/mac-installer": "Homebrew",
"madwizard/apriori/in-terminal": "HTML",
"Start a new Run####Connect Dashboard to an existing Run####Boot up a Cloud Computer####Shut down a Cloud Computer": "Start a new Run",
"Run with CodeFlare Model Architecture####Bring Your Own Code####Demos": "Run with CodeFlare Model Architecture",
"Training Tasks####Fine Tuning Tasks": "Training Tasks",
"Train a Masked Language Model": "Train a Masked Language Model",
"I want to run a quick test with sample data####I have my own custom input data on S3": "I want to run a quick test with sample data",
"AWS####IBM####My data is not stored in S3": "My data is not stored in S3",
"Run Locally####Run on a Kubernetes Cluster": "Run on a Kubernetes Cluster",
"My Cluster is Running Locally####My Cluster is Running on Kubernetes": "My Cluster is Running on Kubernetes",
"expand((kubectl config get-contexts -o name | grep -E . >& /dev/null && kubectl config get-contexts -o name) || (kubectl version | grep Server >& /dev/null && echo \"${KUBE_CONTEXT_FOR_TEST-In-cluster}\" || exit 1), Kubernetes contexts)": "kind-codeflare-test",
"expand([ -z ${KUBE_CONTEXT} ] && exit 1 || X=$([ -n \"$KUBE_NS_FOR_TEST\" ] && echo $KUBE_NS_FOR_TEST || kubectl ${KUBE_CONTEXT_ARG} get ns -o name || oc ${KUBE_CONTEXT_ARG} get projects -o name); echo \"$X\" | sed -E 's#(namespace|project.project.openshift.io)/##' | grep -Ev 'openshift|kube-', Kubernetes namespaces)####Create a namespace": "default",
"Number of CPUs####Number of GPUs####Minimum Workers####Maximum Workers####Worker Memory####Head Memory": "{\"Number of CPUs\":\"1\",\"Number of GPUs\":\"1\",\"Minimum Workers\":\"1\",\"Maximum Workers\":\"1\",\"Worker Memory\":\"8Gi\",\"Head Memory\":\"8Gi\"}",
"Keep It Simple####Use the Ray Autoscaler####Use the Multi-user Enhanced Kubernetes Scheduler": "Use the Multi-user Enhanced Kubernetes Scheduler",
"My administrator has already installed and configured MCAD####MCAD with the Advanced Coscheduler####MCAD with the Default Kubernetes Scheduler": "MCAD with the Advanced Coscheduler"
}
}
26 changes: 26 additions & 0 deletions tests/kind/profiles/roberta-1gpu/mcad-default
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"name": "mcad-default",
"creationTime": 1660657756574,
"lastModifiedTime": 1660747919298,
"lastUsedTime": 1660753306596,
"choices": {
"madwizard/apriori/use-gpu": "don't use gpus",
"madwizard/apriori/arch": "x64",
"madwizard/apriori/platform": "darwin",
"madwizard/apriori/mac-installer": "Homebrew",
"madwizard/apriori/in-terminal": "HTML",
"Start a new Run####Connect Dashboard to an existing Run####Boot up a Cloud Computer####Shut down a Cloud Computer": "Start a new Run",
"Run with CodeFlare Model Architecture####Bring Your Own Code####Demos": "Run with CodeFlare Model Architecture",
"Training Tasks####Fine Tuning Tasks": "Training Tasks",
"Train a Masked Language Model": "Train a Masked Language Model",
"I want to run a quick test with sample data####I have my own custom input data on S3": "I want to run a quick test with sample data"
"AWS####IBM####My data is not stored in S3": "My data is not stored in S3",
"Run Locally####Run on a Kubernetes Cluster": "Run on a Kubernetes Cluster",
"My Cluster is Running Locally####My Cluster is Running on Kubernetes": "My Cluster is Running on Kubernetes",
"expand((kubectl config get-contexts -o name | grep -E . >& /dev/null && kubectl config get-contexts -o name) || (kubectl version | grep Server >& /dev/null && echo \"${KUBE_CONTEXT_FOR_TEST-In-cluster}\" || exit 1), Kubernetes contexts)": "kind-codeflare-test",
"expand([ -z ${KUBE_CONTEXT} ] && exit 1 || X=$([ -n \"$KUBE_NS_FOR_TEST\" ] && echo $KUBE_NS_FOR_TEST || kubectl ${KUBE_CONTEXT_ARG} get ns -o name || oc ${KUBE_CONTEXT_ARG} get projects -o name); echo \"$X\" | sed -E 's#(namespace|project.project.openshift.io)/##' | grep -Ev 'openshift|kube-', Kubernetes namespaces)####Create a namespace": "default",
"Number of CPUs####Number of GPUs####Minimum Workers####Maximum Workers####Worker Memory####Head Memory": "{\"Number of CPUs\":\"1\",\"Number of GPUs\":\"1\",\"Minimum Workers\":\"1\",\"Maximum Workers\":\"1\",\"Worker Memory\":\"8Gi\",\"Head Memory\":\"8Gi\"}",
"Keep It Simple####Use the Ray Autoscaler####Use the Multi-user Enhanced Kubernetes Scheduler": "Use the Multi-user Enhanced Kubernetes Scheduler",
"My administrator has already installed and configured MCAD####MCAD with the Advanced Coscheduler####MCAD with the Default Kubernetes Scheduler": "MCAD with the Default Kubernetes Scheduler"
}
}
26 changes: 26 additions & 0 deletions tests/kind/profiles/roberta-1gpu/mcad-preinstalled
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"name": "mcad-preinstalled",
"creationTime": 1660657756574,
"lastModifiedTime": 1660747919298,
"lastUsedTime": 1660832127576,
"choices": {
"madwizard/apriori/use-gpu": "don't use gpus",
"madwizard/apriori/arch": "x64",
"madwizard/apriori/platform": "darwin",
"madwizard/apriori/mac-installer": "Homebrew",
"madwizard/apriori/in-terminal": "HTML",
"Start a new Run####Connect Dashboard to an existing Run####Boot up a Cloud Computer####Shut down a Cloud Computer": "Start a new Run",
"Run with CodeFlare Model Architecture####Bring Your Own Code####Demos": "Run with CodeFlare Model Architecture",
"Training Tasks####Fine Tuning Tasks": "Training Tasks",
"Train a Masked Language Model": "Train a Masked Language Model",
"I want to run a quick test with sample data####I have my own custom input data on S3": "I want to run a quick test with sample data",
"AWS####IBM####My data is not stored in S3": "My data is not stored in S3",
"Run Locally####Run on a Kubernetes Cluster": "Run on a Kubernetes Cluster",
"My Cluster is Running Locally####My Cluster is Running on Kubernetes": "My Cluster is Running on Kubernetes",
"expand((kubectl config get-contexts -o name | grep -E . >& /dev/null && kubectl config get-contexts -o name) || (kubectl version | grep Server >& /dev/null && echo \"${KUBE_CONTEXT_FOR_TEST-In-cluster}\" || exit 1), Kubernetes contexts)": "kind-codeflare-test",
"expand([ -z ${KUBE_CONTEXT} ] && exit 1 || X=$([ -n \"$KUBE_NS_FOR_TEST\" ] && echo $KUBE_NS_FOR_TEST || kubectl ${KUBE_CONTEXT_ARG} get ns -o name || oc ${KUBE_CONTEXT_ARG} get projects -o name); echo \"$X\" | sed -E 's#(namespace|project.project.openshift.io)/##' | grep -Ev 'openshift|kube-', Kubernetes namespaces)####Create a namespace": "default",
"Number of CPUs####Number of GPUs####Minimum Workers####Maximum Workers####Worker Memory####Head Memory": "{\"Number of CPUs\":\"1\",\"Number of GPUs\":\"1\",\"Minimum Workers\":\"1\",\"Maximum Workers\":\"1\",\"Worker Memory\":\"8Gi\",\"Head Memory\":\"8Gi\"}",
"Keep It Simple####Use the Ray Autoscaler####Use the Multi-user Enhanced Kubernetes Scheduler": "Use the Multi-user Enhanced Kubernetes Scheduler",
"My administrator has already installed and configured MCAD####MCAD with the Advanced Coscheduler####MCAD with the Default Kubernetes Scheduler": "My administrator has already installed and configured MCAD"
}
}
25 changes: 25 additions & 0 deletions tests/kind/profiles/roberta-1gpu/ray-autoscaler
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"name": "ray-autoscaler",
"creationTime": 1660657756574,
"lastModifiedTime": 1660675440396,
"lastUsedTime": 1660743373674,
"choices": {
"madwizard/apriori/use-gpu": "don't use gpus",
"madwizard/apriori/arch": "x64",
"madwizard/apriori/platform": "darwin",
"madwizard/apriori/mac-installer": "Homebrew",
"madwizard/apriori/in-terminal": "HTML",
"Start a new Run####Connect Dashboard to an existing Run####Boot up a Cloud Computer####Shut down a Cloud Computer": "Start a new Run",
"Run with CodeFlare Model Architecture####Bring Your Own Code####Demos": "Run with CodeFlare Model Architecture",
"Training Tasks####Fine Tuning Tasks": "Training Tasks",
"Train a Masked Language Model": "Train a Masked Language Model",
"I want to run a quick test with sample data####I have my own custom input data on S3": "I want to run a quick test with sample data",
"AWS####IBM####My data is not stored in S3": "My data is not stored in S3",
"Run Locally####Run on a Kubernetes Cluster": "Run on a Kubernetes Cluster",
"My Cluster is Running Locally####My Cluster is Running on Kubernetes": "My Cluster is Running on Kubernetes",
"expand((kubectl config get-contexts -o name | grep -E . >& /dev/null && kubectl config get-contexts -o name) || (kubectl version | grep Server >& /dev/null && echo \"${KUBE_CONTEXT_FOR_TEST-In-cluster}\" || exit 1), Kubernetes contexts)": "kind-codeflare-test",
"expand([ -z ${KUBE_CONTEXT} ] && exit 1 || X=$([ -n \"$KUBE_NS_FOR_TEST\" ] && echo $KUBE_NS_FOR_TEST || kubectl ${KUBE_CONTEXT_ARG} get ns -o name || oc ${KUBE_CONTEXT_ARG} get projects -o name); echo \"$X\" | sed -E 's#(namespace|project.project.openshift.io)/##' | grep -Ev 'openshift|kube-', Kubernetes namespaces)####Create a namespace": "default",
"Number of CPUs####Number of GPUs####Minimum Workers####Maximum Workers####Worker Memory####Head Memory": "{\"Number of CPUs\":\"1\",\"Number of GPUs\":\"1\",\"Minimum Workers\":\"1\",\"Maximum Workers\":\"1\",\"Worker Memory\":\"8Gi\",\"Head Memory\":\"8Gi\"}",
"Keep It Simple####Use the Ray Autoscaler####Use the Multi-user Enhanced Kubernetes Scheduler": "Use the Ray Autoscaler"
}
}
Loading

0 comments on commit f2fbfd2

Please sign in to comment.