Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: [ci] run container-canary tests on built images #670

Draft
wants to merge 41 commits into
base: branch-24.10
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
e4ffd81
[ci] run container-canary tests on built images
jameslamb May 21, 2024
4dac77e
test running container-canary in CI
jameslamb May 23, 2024
cfc9910
add CI script
jameslamb May 23, 2024
3ce11df
fix workflow file
jameslamb May 23, 2024
9b569ba
try docker-in-docker
jameslamb May 24, 2024
d05d3f3
workflow syntax
jameslamb May 24, 2024
6cb00b6
Merge branch 'branch-24.08' into container-canary
jameslamb May 24, 2024
be12ccd
skip 'rapids' package because 24.08 releases arent available yet
jameslamb May 24, 2024
fd342f7
use sh
jameslamb May 24, 2024
8d51b50
no mamba, update to container-canary v0.3.1
jameslamb May 24, 2024
010ca7a
use go install
jameslamb May 24, 2024
2f073c4
fix version
jameslamb May 24, 2024
68c0e41
skip nvidia-smi
jameslamb May 24, 2024
9ad965a
non-0 fetch depth, make 'bash' explicit
jameslamb May 24, 2024
034ad3a
bash does not exist in the dind image
jameslamb May 24, 2024
9a651b2
executable is called container-canary when you 'go install' it
jameslamb May 24, 2024
4cb2150
fix tags
jameslamb May 24, 2024
f03ee6e
trailing comma
jameslamb May 24, 2024
018b225
try adding registry
jameslamb May 28, 2024
1422b3b
override command
jameslamb May 28, 2024
f2dfdea
get better logs
jameslamb May 28, 2024
d7dc56e
just add one check, increase timeout
jameslamb May 28, 2024
cc4aada
try an even simpler image
jameslamb May 28, 2024
2c51893
try to get a pass
jameslamb May 28, 2024
258bfaf
fix
jameslamb May 28, 2024
7a1f3dc
merge in branch-24.10
jameslamb Aug 1, 2024
c9a18e6
smaller matrix, try startup timeout
jameslamb Aug 1, 2024
2a97385
avoid some dependencies (for testing)
jameslamb Aug 1, 2024
ebc06ac
skip some CI
jameslamb Aug 1, 2024
a898a40
test with longer startup timeout
jameslamb Aug 2, 2024
502d24e
Go cares about case in GitHub URLs
jameslamb Aug 2, 2024
f8addbc
stop using docker-in-docker
jameslamb Aug 2, 2024
bce89eb
Go is already installed on ubuntu-latest
jameslamb Aug 2, 2024
ef7cd0c
set up Go
jameslamb Aug 2, 2024
e34137b
maybe /tmp is writeable
jameslamb Aug 5, 2024
23ee5ef
fix install
jameslamb Aug 5, 2024
abec4fe
actually run validation
jameslamb Aug 5, 2024
98a1ed0
use the real RAPIDS images
jameslamb Aug 5, 2024
eee5fa3
split up checks
jameslamb Aug 6, 2024
802f7a6
pull in latest changes
jameslamb Aug 14, 2024
c968434
Merge branch 'branch-24.10' into container-canary
jameslamb Aug 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 43 additions & 43 deletions .github/workflows/build-image.yml
Original file line number Diff line number Diff line change
Expand Up @@ -120,46 +120,46 @@ jobs:
PYTHON_VER=${{ inputs.PYTHON_VER }}
RAPIDS_VER=${{ inputs.RAPIDS_VER }}
tags: ${{ inputs.NOTEBOOKS_TAG }}-${{ matrix.ARCH }}
- name: Build RAFT ANN Benchmarks GPU image
uses: docker/build-push-action@v6
with:
context: context
file: raft-ann-bench/gpu/Dockerfile
target: raft-ann-bench
push: true
pull: true
build-args: |
CUDA_VER=${{ inputs.CUDA_VER }}
LINUX_VER=${{ inputs.LINUX_VER }}
PYTHON_VER=${{ inputs.PYTHON_VER }}
RAPIDS_VER=${{ inputs.RAPIDS_VER }}
tags: ${{ inputs.RAFT_ANN_BENCH_TAG }}-${{ matrix.ARCH }}
- name: Build RAFT ANN Benchmarks GPU with datasets image
uses: docker/build-push-action@v6
with:
context: context
file: raft-ann-bench/gpu/Dockerfile
target: raft-ann-bench-datasets
push: true
pull: true
build-args: |
CUDA_VER=${{ inputs.CUDA_VER }}
LINUX_VER=${{ inputs.LINUX_VER }}
PYTHON_VER=${{ inputs.PYTHON_VER }}
RAPIDS_VER=${{ inputs.RAPIDS_VER }}
tags: ${{ inputs.RAFT_ANN_BENCH_DATASETS_TAG }}-${{ matrix.ARCH }}
- name: Build RAFT ANN Benchmarks CPU image
if: inputs.BUILD_RAFT_ANN_BENCH_CPU_IMAGE
uses: docker/build-push-action@v6
with:
context: context
file: raft-ann-bench/cpu/Dockerfile
target: raft-ann-bench-cpu
push: true
pull: true
build-args: |
CUDA_VER=${{ inputs.CUDA_VER }}
LINUX_VER=${{ inputs.LINUX_VER }}
PYTHON_VER=${{ inputs.PYTHON_VER }}
RAPIDS_VER=${{ inputs.RAPIDS_VER }}
tags: ${{ inputs.RAFT_ANN_BENCH_CPU_TAG }}-${{ matrix.ARCH }}
# - name: Build RAFT ANN Benchmarks GPU image
# uses: docker/build-push-action@v6
# with:
# context: context
# file: raft-ann-bench/gpu/Dockerfile
# target: raft-ann-bench
# push: true
# pull: true
# build-args: |
# CUDA_VER=${{ inputs.CUDA_VER }}
# LINUX_VER=${{ inputs.LINUX_VER }}
# PYTHON_VER=${{ inputs.PYTHON_VER }}
# RAPIDS_VER=${{ inputs.RAPIDS_VER }}
# tags: ${{ inputs.RAFT_ANN_BENCH_TAG }}-${{ matrix.ARCH }}
# - name: Build RAFT ANN Benchmarks GPU with datasets image
# uses: docker/build-push-action@v6
# with:
# context: context
# file: raft-ann-bench/gpu/Dockerfile
# target: raft-ann-bench-datasets
# push: true
# pull: true
# build-args: |
# CUDA_VER=${{ inputs.CUDA_VER }}
# LINUX_VER=${{ inputs.LINUX_VER }}
# PYTHON_VER=${{ inputs.PYTHON_VER }}
# RAPIDS_VER=${{ inputs.RAPIDS_VER }}
# tags: ${{ inputs.RAFT_ANN_BENCH_DATASETS_TAG }}-${{ matrix.ARCH }}
# - name: Build RAFT ANN Benchmarks CPU image
# if: inputs.BUILD_RAFT_ANN_BENCH_CPU_IMAGE
# uses: docker/build-push-action@v6
# with:
# context: context
# file: raft-ann-bench/cpu/Dockerfile
# target: raft-ann-bench-cpu
# push: true
# pull: true
# build-args: |
# CUDA_VER=${{ inputs.CUDA_VER }}
# LINUX_VER=${{ inputs.LINUX_VER }}
# PYTHON_VER=${{ inputs.PYTHON_VER }}
# RAPIDS_VER=${{ inputs.RAPIDS_VER }}
# tags: ${{ inputs.RAFT_ANN_BENCH_CPU_TAG }}-${{ matrix.ARCH }}
31 changes: 31 additions & 0 deletions .github/workflows/build-test-publish-images.yml
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,37 @@ jobs:
GPUCIBOT_DOCKERHUB_TOKEN: ${{ secrets.GPUCIBOT_DOCKERHUB_TOKEN }}
ARCHES: ${{ toJSON(matrix.ARCHES) }}
run: ci/create-multiarch-manifest.sh
validate:
needs: [build, build-multiarch-manifest]
strategy:
matrix: ${{ fromJSON(needs.compute-matrix.outputs.TEST_MATRIX) }}
fail-fast: false
secrets: inherit
uses: ./.github/workflows/validate.yml
with:
ARCH: ${{ matrix.ARCH }}
CONTAINER_CANARY_VERSION: main
CUDA_VER: ${{ matrix.CUDA_VER }}
GPU: ${{ matrix.GPU }}
DRIVER: ${{ matrix.DRIVER }}
PYTHON_VER: ${{ matrix.PYTHON_VER }}
# images to test
BASE_TAG:
"docker.io/rapidsai/${{ needs.compute-matrix.outputs.BASE_IMAGE_REPO }}:\
${{ needs.compute-matrix.outputs.BASE_TAG_PREFIX }}\
${{ needs.compute-matrix.outputs.RAPIDS_VER }}\
${{ needs.compute-matrix.outputs.ALPHA_TAG }}-\
cuda${{ matrix.CUDA_VER }}-\
py${{ matrix.PYTHON_VER }}-\
${{ matrix.ARCH }}"
NOTEBOOKS_TAG:
"docker.io/rapidsai/${{ needs.compute-matrix.outputs.NOTEBOOKS_IMAGE_REPO }}:\
${{ needs.compute-matrix.outputs.NOTEBOOKS_TAG_PREFIX }}\
${{ needs.compute-matrix.outputs.RAPIDS_VER }}\
${{ needs.compute-matrix.outputs.ALPHA_TAG }}-\
cuda${{ matrix.CUDA_VER }}-\
py${{ matrix.PYTHON_VER }}-\
${{ matrix.ARCH }}"
test:
needs: [compute-matrix, build]
if: inputs.run_tests
Expand Down
93 changes: 93 additions & 0 deletions .github/workflows/validate.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
name: Validate images

on:
workflow_call:
inputs:
ARCH:
required: true
type: string
# a tag from https://github.com/NVIDIA/container-canary/releases
CONTAINER_CANARY_VERSION:
description: 'tag from https://github.com/NVIDIA/container-canary/releases'
required: true
type: string
CUDA_VER:
required: true
type: string
DRIVER:
required: true
type: string
GPU:
required: true
type: string
PYTHON_VER:
required: true
type: string
BASE_TAG:
required: true
type: string
NOTEBOOKS_TAG:
required: true
type: string

defaults:
run:
shell: sh

permissions:
actions: read
checks: none
contents: read
deployments: none
discussions: none
id-token: write
issues: none
packages: read
pages: none
pull-requests: read
repository-projects: none
security-events: none
statuses: none

jobs:
validate:
strategy:
matrix:
ARCH: ["${{ inputs.ARCH }}"]
CUDA_VER: ["${{ inputs.CUDA_VER }}"]
PYTHON_VER: ["${{ inputs.PYTHON_VER }}"]
GPU: ["${{ inputs.GPU }}"]
DRIVER: ["${{ inputs.DRIVER }}"]
fail-fast: false
runs-on: "linux-${{ inputs.ARCH }}-cpu4"
# container:
# image: 'docker:dind'
# options: --privileged
# env:
# NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 1
- name: Install Go
uses: actions/setup-go@v5
with:
go-version: '1.22.x'
- name: Install container-canary
run: |
GOBIN=/tmp/canary-bin go install github.com/nvidia/container-canary@${{ inputs.CONTAINER_CANARY_VERSION }}
/tmp/canary-bin/container-canary version
- name: (base) container-canary checks
run: |
export PATH="/tmp/canary-bin:${PATH}"
sh ./ci/container-canary/run-checks.sh \
--dask-scheduler \
${{ inputs.BASE_TAG }}
- name: (notebooks) container-canary checks
run: |
export PATH="/tmp/canary-bin:${PATH}"
sh ./ci/container-canary/run-checks.sh \
--dask-scheduler \
--notebooks \
${{ inputs.NOTEBOOK_TAG }}
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ COPY condarc /opt/conda/.condarc

RUN <<EOF
mamba install -y -n base \
"rapids=${RAPIDS_VER}.*" \
"cugraph=${RAPIDS_VER}.*" \
"python=${PYTHON_VER}.*" \
"cuda-version=${CUDA_VER%.*}.*" \
ipython
Expand Down
24 changes: 24 additions & 0 deletions ci/container-canary/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# container-canary

Configurations for testing images built from this repo with `container-canary` ([NVIDIA/container-canary](https://github.com/NVIDIA/container-canary)).

## Running the tests

Install `container-canary` following the instructions in that project's repo.

Run the tests against a built image.
For example:

```shell
IMAGE_URI="rapidsai/notebooks:24.06a-cuda11.8-py3.11"

# using a config checked in here
canary validate \
--file ./ci/container-canary/rapids.yml \
"${IMAGE_URI}"

# usage a config from the container-canary repo
canary validate \
--file https://raw.githubusercontent.com/NVIDIA/container-canary/main/examples/databricks.yaml \
"${IMAGE_URI}"
```
72 changes: 72 additions & 0 deletions ci/container-canary/base.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
apiVersion: container-canary.nvidia.com/v1
kind: Validator
name: rapids-base
description: |
Tests characteristics that the general-purpose RAPIDS images expected to have.
documentation: https://github.com/rapidsai/docker
# This command just ensures the container stays up long enough for
# all checks to complete.
command:
- /bin/sh
- -c
- "sleep 600"
checks:
- name: tool-conda
description: conda can be executed
probe:
exec:
command:
- conda
- --version
- name: tool-dask-cli
description: Dask CLI can be executed
probe:
exec:
command:
- python
- -m
- dask
- --version
timeoutSeconds: 10
# ref: https://github.com/rapidsai/docker/issues/668
- name: tool-distributed-spec-cli
description: Distributed dask_spec CLI can be executed
probe:
exec:
command:
- python
- -m
- distributed.cli.dask_spec
- --version
- name: user-is-rapids
description: Default user is rapids (uid=1001)
probe:
exec:
command:
- /bin/sh
- -c
- 'test "$(id)" = "uid=1001(rapids) gid=1000(conda) groups=1000(conda)"'
- name: home-directory
description: $HOME is "/home/rapids"
probe:
exec:
command:
- /bin/sh
- -c
- 'test "$HOME" = "/home/rapids"'
- name: working-directory
description: Working directory is /home/rapids
probe:
exec:
command:
- /bin/sh
- -c
- 'test "$(pwd)" = "/home/rapids"'
- name: conda-only-base-env
description: The only defined conda env is "base"
probe:
exec:
command:
- /bin/bash
- -c
- "[[ $(conda env list --quiet | grep --count -E '^[A-Za-z]+') == 1 ]];"
16 changes: 16 additions & 0 deletions ci/container-canary/notebooks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
apiVersion: container-canary.nvidia.com/v1
kind: Validator
name: rapids-notebooks
description: |
Tests characteristics that any RAPIDS images shipping Jupyter
are expected to have.
documentation: https://github.com/rapidsai/docker
checks:
- name: tool-jupyter-lab
description: jupyter lab can be executed
probe:
exec:
command:
- jupyter
- lab
- --version
42 changes: 42 additions & 0 deletions ci/container-canary/run-checks.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
#!/bin/bash

set -e -E -u -o pipefail

IMAGE_URI="${1}"

NUMARGS=$#
ARGS=$*

function hasArg {
(( NUMARGS != 0 )) && (echo " ${ARGS} " | grep -q " $1 ")
}

# pre-pull
docker pull "${IMAGE_URI}"

check_configs=(
./ci/container-canary/base.yml
)

if hasArg '--notebooks'; then
check_configs+=(./ci/container-canary/notebooks.yml)
fi

if hasArg '--dask-scheduler'; then
check_configs+=(https://raw.githubusercontent.com/NVIDIA/container-canary/main/examples/dask-scheduler.yaml)
fi

for check_config in "${check_configs[@]}"; do
echo "checking '${IMAGE_URI}' with '${check_config}'"
canary validate \
--file "${check_config}" \
--startup-timeout 60 \
"${IMAGE_URI}"
done

echo "done checking '${IMAGE_URI}' with container-canary"

# usage a config from the container-canary repo
canary validate \
--file https://raw.githubusercontent.com/NVIDIA/container-canary/main/examples/dask-scheduler.yaml \
"rapidsai/notebooks:24.10a-cuda12.2-py3.11"
4 changes: 2 additions & 2 deletions matrix.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
CUDA_VER: # Should be `<major>.<minor>.<patch>` (e.g. `11.2.2`)
- "11.8.0"
- "12.0.1"
# - "11.8.0"
# - "12.0.1"
- "12.2.2"
- "12.5.1"
PYTHON_VER:
Expand Down
Loading