end-to-end gpu driver testing enhancement #88

shivakunv · 2024-08-16T10:22:51Z

TODO--

*1. *

matrix:

driver:
- 535.183.06
- 550.90.07
  An idea for a potential follow-up – instead of defining a matrix and spinning up one AWS instance per driver version, can we instead pass all driver versions as input to the test script and test all of them in sequence? For example, first install 535.183.06, and then upgrade to 550.90.07? This would also allow us to source the driver versions from the versions.mk file, instead of having to redefine and maintain that list here.

2
+name: CI
Let's rename this to End-to-end tests

3
+# Install the operator with usePrecompiled mode set to true
Remove this comment as it is not accurate.

4

AWS_SESSION_TOKEN: ${{ secrets.AWS_SESSION_TOKEN }}
I don't believe AWS_SESSION_TOKEN is needed with current holodeck implementation. Let's remove.

5
We can simply wait for the nvidia-driver pod to be ready
kubectl wait -n ${TEST_NAMESPACE} --for=condition=Ready pod -l app=nvidia-driver-daemonset --timeout 10m
If successful, then wait for the validator pod to be ready (this means that the rest of the pods are healthy):
kubectl wait -n ${TEST_NAMESPACE} --for=condition=Ready pod -l app=nvidia-operator-validator --timeout 2m
If either of these commands fails, capture the state of all pods in the operator namespace, by running kubectl get pods -n ${TEST_NAMESPACE} , and also capture some logs so we can debug.
This will reduce the amount of logs emitted during the test. Right now, we print out all pods every 5 seconds so it is very unreadable

.github/workflows/ci.yaml

.github/workflows/image.yaml

tests/scripts/verify-operator.sh

tests/scripts/checks.sh

tests/scripts/remote.sh

tests/scripts/verify-operator.sh

shivakunv · 2024-08-17T06:32:41Z

@cdesiniotis PTAL

.github/workflows/ci.yaml

tests/scripts/must-gather.sh

tests/scripts/uninstall-operator.sh

Signed-off-by: shiva kumar <[email protected]>

shivakunv self-assigned this Aug 16, 2024

shivakunv force-pushed the enhancegpuvalidation branch from bf4b25a to b29b09c Compare August 16, 2024 10:24

shivakunv requested a review from cdesiniotis August 16, 2024 10:24

shivakunv commented Aug 16, 2024

View reviewed changes

.github/workflows/ci.yaml Outdated Show resolved Hide resolved

shivakunv commented Aug 16, 2024

View reviewed changes

.github/workflows/ci.yaml Outdated Show resolved Hide resolved

shivakunv commented Aug 16, 2024

View reviewed changes

.github/workflows/image.yaml Outdated Show resolved Hide resolved

shivakunv commented Aug 16, 2024

View reviewed changes

tests/scripts/verify-operator.sh Outdated Show resolved Hide resolved

shivakunv force-pushed the enhancegpuvalidation branch from b29b09c to b042889 Compare August 16, 2024 10:31

shivakunv marked this pull request as ready for review August 16, 2024 10:32

shivakunv force-pushed the enhancegpuvalidation branch 16 times, most recently from ae9de5c to e6d824a Compare August 16, 2024 20:16

cdesiniotis reviewed Aug 16, 2024

View reviewed changes

tests/scripts/checks.sh Outdated Show resolved Hide resolved

tests/scripts/remote.sh Show resolved Hide resolved

tests/scripts/verify-operator.sh Outdated Show resolved Hide resolved

tests/scripts/verify-operator.sh Outdated Show resolved Hide resolved

shivakunv force-pushed the enhancegpuvalidation branch 4 times, most recently from 07a660f to 84f80c9 Compare August 17, 2024 05:29

shivakunv force-pushed the enhancegpuvalidation branch from 84f80c9 to 13839ac Compare August 17, 2024 06:11

cdesiniotis reviewed Aug 19, 2024

View reviewed changes

.github/workflows/ci.yaml Outdated Show resolved Hide resolved

tests/scripts/must-gather.sh Outdated Show resolved Hide resolved

tests/scripts/uninstall-operator.sh Outdated Show resolved Hide resolved

shivakunv force-pushed the enhancegpuvalidation branch 9 times, most recently from d97cc36 to c26329e Compare August 20, 2024 09:21

end-to-end gpu driver testing enhancement

c6f8865

Signed-off-by: shiva kumar <[email protected]>

shivakunv force-pushed the enhancegpuvalidation branch from c26329e to c6f8865 Compare August 20, 2024 09:40

cdesiniotis approved these changes Aug 20, 2024

View reviewed changes

cdesiniotis merged commit f8c3a2b into NVIDIA:main Aug 20, 2024
6 checks passed

shivakunv deleted the enhancegpuvalidation branch September 4, 2024 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

end-to-end gpu driver testing enhancement #88

end-to-end gpu driver testing enhancement #88

shivakunv commented Aug 16, 2024

shivakunv commented Aug 17, 2024

end-to-end gpu driver testing enhancement #88

end-to-end gpu driver testing enhancement #88

Conversation

shivakunv commented Aug 16, 2024

shivakunv commented Aug 17, 2024