Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: extend integration tests for experiments #105

Merged
merged 52 commits into from
Aug 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
3935bf7
feat: extend integration tests
NohaIhab Aug 2, 2023
61d6719
fix: remove kubeflow namespace from experiment
NohaIhab Aug 2, 2023
2716162
fix: missing namespace parameter
NohaIhab Aug 2, 2023
1d57f27
fix: assert_exp_status_running parameters
NohaIhab Aug 2, 2023
fd9c9f9
feat: assert trials
NohaIhab Aug 2, 2023
179840a
feat: test delete experiment
NohaIhab Aug 3, 2023
6df870e
fix: re-organize test data files
NohaIhab Aug 3, 2023
bc2381c
Merge branch 'main' into kf-4009-feat-integration-tests
NohaIhab Aug 3, 2023
507bbdb
feat: add examples to cover all images and reduce trial counts
NohaIhab Aug 8, 2023
1a50b2a
fix: refactor tests file structure
NohaIhab Aug 8, 2023
c7dc4c2
fix: increase retry attempts
NohaIhab Aug 8, 2023
3cda835
Merge branch 'kf-4009-feat-integration-tests' of https://github.com/c…
NohaIhab Aug 8, 2023
7d624fb
fix: remove test deploy from test_katib_experiments.py
NohaIhab Aug 8, 2023
2f9e3e3
fix: remove experiments test from test_charms.py
NohaIhab Aug 8, 2023
6765d7b
fix: format and lint
NohaIhab Aug 8, 2023
21baab5
feat: assert experiments succeeded and remove trials check
NohaIhab Aug 9, 2023
4f0197f
fix: typo
NohaIhab Aug 9, 2023
b2f9f82
fix: change container images in examples to v0.15.0
NohaIhab Aug 9, 2023
1925c42
fix: correct db relation in bundle test
NohaIhab Aug 9, 2023
822abd5
feat: check experiment is running or succeeded
NohaIhab Aug 9, 2023
fc5c15e
test: use relation for mariadb
NohaIhab Aug 9, 2023
63f15d8
test: comment exhaustive enas example
NohaIhab Aug 9, 2023
7c48953
[skip] ci: add ssh
NohaIhab Aug 10, 2023
2b53d97
Merge branch 'kf-4009-feat-integration-tests' of https://github.com/c…
NohaIhab Aug 10, 2023
8b6aaf8
[skip] uncomment enas example
NohaIhab Aug 10, 2023
6338c08
[skip] ci: ssh bundle tests
NohaIhab Aug 10, 2023
8c959e4
[skip] ci: ssh bundle tests
NohaIhab Aug 10, 2023
b95f75b
[skip] ci: ssh bundle tests after setup
NohaIhab Aug 10, 2023
82a9af8
fix: add missing seperator
NohaIhab Aug 10, 2023
8065eec
attempt to fix CI
NohaIhab Aug 10, 2023
201b7a7
[skip] indent CI step
NohaIhab Aug 10, 2023
068058e
Merge branch 'main' into kf-4009-feat-integration-tests
NohaIhab Aug 10, 2023
9a1283c
feat:set resources limit to examples
NohaIhab Aug 11, 2023
e1cb018
fix: remove unnecessary istio sidecar annotation
NohaIhab Aug 11, 2023
13602d7
fix: hyperband example
NohaIhab Aug 11, 2023
a63d483
[skip] fix: typo
NohaIhab Aug 13, 2023
0025301
feat: add comments on upstream differeneces
NohaIhab Aug 13, 2023
4e5466d
[skip] fix: remove ununsed logger
NohaIhab Aug 13, 2023
8ffc3a9
[skip] fix: method name more meaningful
NohaIhab Aug 13, 2023
87116ca
[skip] fix: test fail message
NohaIhab Aug 13, 2023
59c5e72
[skip] fix: remove whitespace
NohaIhab Aug 13, 2023
b7818a3
fix: address comments
NohaIhab Aug 13, 2023
97cf046
Merge branch 'kf-4009-feat-integration-tests' of https://github.com/c…
NohaIhab Aug 13, 2023
605def4
[skip] fix: correct upstream links
NohaIhab Aug 14, 2023
b8dbd5f
feat: CR discovery
NohaIhab Aug 15, 2023
430b003
feat: default to module logger in utils.py
NohaIhab Aug 15, 2023
7689e55
feat: deploy experiments in user namespace with profiles operator
NohaIhab Aug 16, 2023
8bbf2a6
fix: address comments
NohaIhab Aug 16, 2023
1083ff6
fix: Minor docstring/log changes
phoevos Aug 16, 2023
cb05fad
fix: remove wait_for_idle
NohaIhab Aug 17, 2023
f8c6e30
fix: use assert rather than raise AssertionError
NohaIhab Aug 17, 2023
6fe5c23
Merge branch 'main' into kf-4009-feat-integration-tests
phoevos Aug 17, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion .github/workflows/integrate.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,20 @@ jobs:
runs-on: ubuntu-20.04

steps:
# Ideally we'd use self-hosted runners, but this effort is still not stable.
# This action will remove unused software (dotnet, haskell, android libs, codeql,
# and docker images) from the GH runner, which will liberate around 60 GB of storage
# distributed in 40GB for root and around 20 for a mnt point.
- name: Maximise GH runner space
uses: easimon/maximize-build-space@v7
with:
root-reserve-mb: 40960
remove-dotnet: 'true'
remove-haskell: 'true'
remove-android: 'true'
remove-codeql: 'true'
remove-docker-images: 'true'

phoevos marked this conversation as resolved.
Show resolved Hide resolved
- name: Check out code
uses: actions/checkout@v3
- name: Setup operator environment
Expand All @@ -102,7 +116,7 @@ jobs:
provider: microk8s
channel: 1.24/stable
juju-channel: 2.9/stable
microk8s-addons: "dns storage rbac metallb:10.64.140.43-10.64.140.49"
microk8s-addons: "dns storage rbac"

- name: Run test
run: |
Expand Down
75 changes: 75 additions & 0 deletions tests/assets/crs/experiments/bayesian-optimization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Source: katib/examples/v1beta1/hp-tuning/bayesian-optimization.yaml
# This example is slightly modified from upstream to consume less resources.
# There's a `modified` comment where we diverge from upstream.
# When updating this file, make sure to keep those modifications.
---
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: bayesian-optimization
spec:
objective:
type: maximize
goal: 0.99
objectiveMetricName: Validation-accuracy
additionalMetricNames:
- Train-accuracy
algorithm:
algorithmName: bayesianoptimization
algorithmSettings:
- name: "random_state"
value: "10"
parallelTrialCount: 1 # modified
maxTrialCount: 1 # modified
maxFailedTrialCount: 1 # modified
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.03"
- name: num-layers
parameterType: int
feasibleSpace:
min: "2"
max: "5"
- name: optimizer
parameterType: categorical
feasibleSpace:
list:
- sgd
- adam
- ftrl
trialTemplate:
primaryContainerName: training-container
trialParameters:
- name: learningRate
description: Learning rate for the training model
reference: lr
- name: numberLayers
description: Number of training model layers
reference: num-layers
- name: optimizer
description: Training model optimizer (sdg, adam or ftrl)
reference: optimizer
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v0.15.0
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
- "--batch-size=64"
- "--lr=${trialParameters.learningRate}"
- "--num-layers=${trialParameters.numberLayers}"
- "--optimizer=${trialParameters.optimizer}"
resources: # modified
limits: # modified
memory: "2Gi" # modified
cpu: "1" # modified
restartPolicy: Never
75 changes: 75 additions & 0 deletions tests/assets/crs/experiments/cmaes.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Source: katib/examples/v1beta1/hp-tuning/cma-es.yaml
# This example is slightly modified from upstream to consume less resources.
# There's a `modified` comment where we diverge from upstream.
# When updating this file, make sure to keep those modifications.
---
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: cmaes
spec:
objective:
type: maximize
goal: 0.99
objectiveMetricName: Validation-accuracy
additionalMetricNames:
- Train-accuracy
algorithm:
algorithmName: cmaes
algorithmSettings:
- name: "restart_strategy"
value: "ipop"
parallelTrialCount: 1 # modified
maxTrialCount: 1 # modified
maxFailedTrialCount: 1 # modified
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.03"
- name: num-layers
parameterType: int
feasibleSpace:
min: "2"
max: "5"
- name: optimizer
parameterType: categorical
feasibleSpace:
list:
- sgd
- adam
- ftrl
trialTemplate:
primaryContainerName: training-container
trialParameters:
- name: learningRate
description: Learning rate for the training model
reference: lr
- name: numberLayers
description: Number of training model layers
reference: num-layers
- name: optimizer
description: Training model optimizer (sdg, adam or ftrl)
reference: optimizer
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v0.15.0
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
- "--batch-size=64"
- "--lr=${trialParameters.learningRate}"
- "--num-layers=${trialParameters.numberLayers}"
- "--optimizer=${trialParameters.optimizer}"
resources: # modified
limits: # modified
memory: "2Gi" # modified
cpu: "1" # modified
restartPolicy: Never
77 changes: 77 additions & 0 deletions tests/assets/crs/experiments/darts-cpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Source: katib/examples/v1beta1/nas/darts-cpu.yaml
# This example is slightly modified from upstream to consume less resources.
# There's a `modified` comment where we diverge from upstream.
# When updating this file, make sure to keep those modifications.
---
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: darts-cpu
spec:
parallelTrialCount: 1
maxTrialCount: 1
maxFailedTrialCount: 1
objective:
type: maximize
objectiveMetricName: Best-Genotype
metricsCollectorSpec:
collector:
kind: StdOut
source:
filter:
metricsFormat:
- "([\\w-]+)=(Genotype.*)"
algorithm:
algorithmName: darts
algorithmSettings:
- name: num_epochs
value: "1"
- name: num_nodes
value: "1"
- name: init_channels
value: "1"
- name: stem_multiplier
value: "1"
nasConfig:
graphConfig:
numLayers: 1
operations:
- operationType: max_pooling
parameters:
- name: filter_size
parameterType: categorical
feasibleSpace:
list:
- "3"
trialTemplate:
primaryContainerName: training-container
trialParameters:
- name: algorithmSettings
description: Algorithm settings of DARTS Experiment
reference: algorithm-settings
- name: searchSpace
description: Search Space of DARTS Experiment
reference: search-space
- name: numberLayers
description: Number of Neural Network layers
reference: num-layers
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/darts-cnn-cifar10-cpu:v0.15.0
command:
- python3
- run_trial.py
- --algorithm-settings="${trialParameters.algorithmSettings}"
- --search-space="${trialParameters.searchSpace}"
- --num-layers="${trialParameters.numberLayers}"
resources: # modified
limits: # modified
memory: "2Gi" # modified
cpu: "1" # modified
restartPolicy: Never
Loading
Loading