Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Nvidia GPU support to the buildah-remote task #1529

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

syedriko
Copy link

@syedriko syedriko commented Oct 23, 2024

Added Nvidia GPU support to the buildah-remote task

@brianwcook
Copy link
Contributor

brianwcook commented Oct 24, 2024

I left you a couple reviews. The buildah-remote tasks are generated by running /hack/generate-buildah-remote.sh in this repo. This ensures they stay consistent with the normal buildah task. You need to modify the main.go called by generate-buildah-remote.sh so that when you run it, it produces the same diff this PR has. Once you run it, the PR should have 3 changed files: the generate script, buildah-remote/0.1/buildah-remote.yaml and buildah-remote/0.2/buildah-remote.yaml.

After that, you also need to run the /hack/generate-ta-tasks.sh. which will update 2 more files (trusted artifacts versions of the 2 buildah remote tasks).

Summary: you will modify 1 file (main.go), run two generate commands and add all those changes here.

@syedriko syedriko changed the title Added Nvidia GPU support to the buildah-remote 2.0 task Added Nvidia GPU support to the buildah-remote task Oct 24, 2024
@syedriko
Copy link
Author

Thanks, @brianwcook! PTAL

@brianwcook
Copy link
Contributor

/ok-to-test

@@ -445,6 +445,11 @@ spec:
REMOTESSHEOF
chmod +x scripts/script-build.sh

PODMAN_NVIDIA_ARGS=()
if [[ "$PLATFORM" == "linux-g"* ]]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried in the past to not depend on the semantics of the PLATFORM parameter. @ifireball @mshaposhnik, what do you think?

The use of the PLATFORM parameter like this would fall in line with the functionality requested in https://issues.redhat.com/browse/KONFLUX-4073.

@chmeliik
Copy link
Contributor

Could you describe what these changes do, how and why? The code change on its own doesn't give me much to work with

@@ -445,6 +445,11 @@ spec:
REMOTESSHEOF
chmod +x scripts/script-build.sh

PODMAN_NVIDIA_ARGS=()
if [[ "$PLATFORM" == "linux-g"* ]]; then
PODMAN_NVIDIA_ARGS+=("--device nvidia.com/gpu=all" "--security-opt=label=disable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the implications of --security-opt=label=disable?

@chmeliik
Copy link
Contributor

I would also consider dropping this PR in favor of #1530, although that one seems like it potentially gives the user too much control

@syedriko
Copy link
Author

syedriko commented Oct 25, 2024

Could you describe what these changes do, how and why? The code change on its own doesn't give me much to work with

The goal here is to allow Konflux builds to access Nvidia GPUs on machines so equipped. An example is running PyTorch during container build - https://github.com/openshift/lightspeed-rag-content/blob/main/Containerfile.

This PR is a building block towards support of this scenario. The others are AWS instance type(s) in multi-platform controller https://github.com/redhat-appstudio/infra-deployments/blob/0b936310854c7b4031b967eda33ad8399f12da60/components/multi-platform-controller/production/common/host-config.yaml#L528 and an AMI with Nvidia drivers.

This PR, for platforms that start with "linux-g", tells podman to pass though Nvidia GPU devices to the containers it runs.

What are the implications of --security-opt=label=disable?

Not too sure about it beyond the obvious, but this came from Nvidia docs https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.14.2/cdi-support.html

Upd: attempted dropping --security-opt=label=disable, build container couldn't access the GPU device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants