Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: GH200 test instability #1600

Merged
merged 5 commits into from
Aug 2, 2024
Merged

ci: GH200 test instability #1600

merged 5 commits into from
Aug 2, 2024

Conversation

edopao
Copy link
Contributor

@edopao edopao commented Aug 2, 2024

It was observed that the CI tests on GPU randomly hang, and eventually time out. In an attempt to make the CI stable on GH200 nodes, this PR proposes 2 changes:

  • Reduce pytest parallelism, by lowering the num processes from 32 to 16 (note that GPU tests exploit CUDA MPS)
  • Reduce the SLURM timeout, in order to early detect when jobs hang (longest job takes 10 min)

Additional change:

  • Improve the hash on the Dockerfile for caching of base image: now also include the build arguments

ci/cscs-ci.yml Outdated Show resolved Hide resolved
@edopao
Copy link
Contributor Author

edopao commented Aug 2, 2024

I need to check with @finkandreas if CUDA MPS on GH200 is configured with multi-GPU support, because that would be desirable in our configuration: https://docs.nvidia.com/deploy/mps/#mps-on-multi-gpu-systems

@edopao edopao marked this pull request as ready for review August 2, 2024 10:53
@edopao
Copy link
Contributor Author

edopao commented Aug 2, 2024

I need to check with @finkandreas if CUDA MPS on GH200 is configured with multi-GPU support, because that would be desirable in our configuration: https://docs.nvidia.com/deploy/mps/#mps-on-multi-gpu-systems

@finkandreas Please ignore this question, No, it is not relevant to us. From the docs:

When CUDA_VISIBLE_DEVICES is set before launching the control daemon, the devices will be remapped by the MPS server.

I was hoping that MPS could automatically dispatch the CUDA kernels across different devices, but this is not the case.

@edopao edopao requested a review from havogt August 2, 2024 11:04
ci/cscs-ci.yml Outdated Show resolved Hide resolved
ci/cscs-ci.yml Outdated Show resolved Hide resolved
Co-authored-by: Hannes Vogt <[email protected]>
@edopao edopao requested a review from havogt August 2, 2024 12:16
Copy link
Contributor

@havogt havogt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@edopao edopao merged commit bd4c48e into GridTools:main Aug 2, 2024
31 checks passed
@edopao edopao deleted the ci-gh200 branch August 2, 2024 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants