-
Notifications
You must be signed in to change notification settings - Fork 2k
cgroup issue with nvidia container runtime on Debian testing #1447
Comments
Hi, I'm experiencing the same issue. For now I've worked around it: In Here are the relevant lines from my
This is equivalent to Hope this helps. |
This seems to be related to Indeed, default setup does not expose anymore Using the documented |
@lissyx Thank you for printing out the crux of the issue. That said, this rearchitecting effort will take at least another 9 months to complete. I'm curious what the impact is (and how difficult it would be to add |
Wanted to also chime in to say that I'm also experiencing this on Fedora 33 |
Could the title be updated to indicate that it is systemd cgroup layout related? |
I was under the impression this issue was related to adding cgroup v2 support. The systemd cgroup layout issue was resoolved in: And released today as part of libnvidia-container v1.3.2: If these resolve this issue, please comment and close. Thanks. |
Issue resolved by the latest release. Thank you everyone <3 |
Did you set the following parameter: Or did you just upgrade all the packages? |
For me it was solved by upgrading the package. |
Thank you, @super-cooper, for the reply. I am having exactly the same issue on Debian Testing even after an upgrade. 1. Issue or feature description
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
@klueska Could you please check the issue? |
@regzon thanks for indicating that this is still and issue. Could you please check what your |
@regzon your issue is likely related to the fact that You will need to follow the suggestion in the comments above for #1447 (comment) to force systemd to use v1 cgroups. In any case -- we do not officially support Debian Testing nor cgroups v2 (yet). |
@elezar @klueska thank you for your help. When forcing the |
@klueska I'm having the same "issue", i.e. missing support for cgroups v2 (which I would very much like for other reasons). |
We are not planning on building support for cgroups v2 into the existing nvidia-docker stack. Please see my comment above for more info: |
Let me rephrase it then: I want to use nvidia-docker on a system where cgroup v2 is enabled ( |
We have it tracked in our internal JIRA with a link to this this issue as the location to report once the work is complete: |
facebook oomd requires cgroup v2, i.e. systemd.unified_cgroup_hierarchy=1. So either users freeze the boxes pretty often and render them unusable, or they cannot use nvidia-containers. Both is crap. We will probably drop the nvidia-docker non-sense. |
For Debian users, you can disable cgroup hierarchy by editing Then run It's worth noting that I also had to modify /etc/nvidia-container-runtime/config.toml to remove the '@' symbol and update to the correct location of ldconfig for my system (Debian Unstable). eg: This worked for me, I hope this saves someone else some time. |
Fix on Arch: Edit /etc/nvidia-container-runtime/config.toml and change #no-cgroups=false to no-cgroups=true. After a restart of the docker.service everything worked as usual. |
@Zethson I also use Arch and yesterday I followed your suggestion. It seemed to work (I was able to start the containers), but running |
Arch has now cgroup v2 enabled by default, so it'd be useful to plan for supporting it. |
Awesome this works well. |
Fix on NixOS (where cgroup v2 is also now default): add |
This worked for me on Manjaro Linux (Arch Linux as base) without deactivating cgroup v2:
After that you have to add the following content to your
Background: The nvidia-uvm and nvidia-uvm-tools folder did not exist unter Edit: This workaround works only if the container using NVIDIA is restarted afterwards. I do not know why, but if not, the container starts, but cannot access the created directories. Update 25.06.2021: |
If you need cgroups active so cannot do
I then had to reboot and the issue is gone. |
Please see NVIDIA/libnvidia-container#111 (comment) for instructions on how to get access to this RC (or wait for the full release at the end of next week). Note: This does not directly add |
This may be useful for Ubuntu users running into this issue:
@klueska , I just wanted to mention when I go to the following URLs: https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list I get a valid apt list in response. But if I visit: https://nvidia.github.io/nvidia-docker/ubuntu18.04/libnvidia-container.list I get It appears the list has been moved back to the original filename? |
ah, 🤦 , much appreciated, thanks for making it explicit. |
Release notes here: |
Debian 11 support has now been added such that running the following should now work as expected:
|
Are any additional steps required? I have libnvidia-container1 in version 1.8.0-1 on PopOS and the error persists:
|
You must not have the package installed correctly because that error message doesn't exist in that from in |
Took an entire day to find the linked comment [1] by @biggs, which says: > Fix on NixOS (where cgroup v2 is also now default): add > `systemd.enableUnifiedCgroupHierarchy = false;` > and restart. Indeed, after applying this commit and then running `sudo systemctl restart docker`, any of the following commands works: ```bash sudo docker run --gpus=all nvidia/cuda:10.0-runtime nvidia-smi sudo docker run --runtime=nvidia nvidia/cuda:10.0-runtime nvidia-smi sudo nvidia-docker run nvidia/cuda:10.0-runtime nvidia-smi ``` ARGH!!!1 Links: [1] NVIDIA/nvidia-docker#1447 (comment) [2] NixOS/nixpkgs#127146 [3] NixOS/nixpkgs#73800 [4] https://blog.zentria.company/posts/nixos-cgroupsv2/ P.S. I use Colemak, but typing arstarstarst doesn't have the same ring to it.
Hi all: I encountered a problem, can you help me? $ docker run --rm --gpus all nvidia/cuda:11.0-base-ubuntu20.04 nvidia-smi Unable to find image 'nvidia/cuda:11.0-base-ubuntu20.04' locally $ nvidia-container-cli -k -d /dev/tty info -- WARNING, the following logs are for debugging purposes only -- I0722 06:41:38.215548 281198 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae) nvidia-smi -a ==============NVSMI LOG============== Timestamp : Fri Jul 22 14:42:41 2022 Attached GPUs : 8 GPU 00000000:1C:00.0 GPU 00000000:1F:00.0 GPU 00000000:23:00.0 GPU 00000000:35:00.0 GPU 00000000:36:00.0 GPU 00000000:39:00.0 GPU 00000000:3D:00.0 $ docker version Server: Docker Engine - Community $ rpm -qa 'nvidia' nvidia-docker2-2.11.0-1.noarch $ nvidia-container-cli -V cli-version: 1.10.0 |
@sippmilkwolf the issue you show is not related to the original pos and can occur when persistence mode is not enabled. Please see https://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon. |
I followed the method in the link you provided and solve the problem, Thank you for your help :-) |
I'm trying to setup nvidia-docker2 on PopOS 22.04, I confirmed the libnvidia-container1 installed is 1.8.0 but I still got the cgroup not found message: |
I also have PopOS 22.04 and the same issue. But sitll no luck. Same error as you @universebreaker
|
@j2l Exactly same error to me! |
@universebreaker @seth100 @j2l there was a bug in the code to handle Note that if you upgrade to |
@elezar I have already latest
|
ok I found the fix: sudo vi /etc/apt/preferences.d/pop-default-settings
# Add this at the end of the file:
# ---
Package: *
Pin: origin nvidia.github.io
Pin-Priority: 1002
# ---
sudo apt-get update
sudo apt-get upgrade here are the package versions now: libnvidia-cfg1-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-common-515/jammy,jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 all [installed,automatic]
libnvidia-compute-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-compute-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 i386 [installed,automatic]
libnvidia-container-tools/bionic,now 1.12.0~rc.1-1 amd64 [installed,automatic]
libnvidia-container1/bionic,now 1.12.0~rc.1-1 amd64 [installed,automatic]
libnvidia-decode-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-decode-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 i386 [installed,automatic]
libnvidia-egl-wayland1/jammy,now 1:1.1.9-1.1 amd64 [installed,automatic]
libnvidia-encode-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-encode-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 i386 [installed,automatic]
libnvidia-extra-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-fbc1-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-fbc1-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 i386 [installed,automatic]
libnvidia-gl-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-gl-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 i386 [installed,automatic]
nvidia-compute-utils-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
nvidia-container-toolkit-base/bionic,now 1.12.0~rc.1-1 amd64 [installed,automatic]
nvidia-container-toolkit/bionic,now 1.12.0~rc.1-1 amd64 [installed,automatic]
nvidia-dkms-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed]
nvidia-docker2/bionic,now 2.11.0-1 all [installed]
nvidia-driver-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed]
nvidia-kernel-common-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
nvidia-kernel-source-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
nvidia-settings/jammy,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
nvidia-utils-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
xserver-xorg-video-nvidia-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic] and now |
method from @seth100 did not work for me on my installation of Pop22.04 but using install instructions from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#setting-up-nvidia-container-toolkit did the trick |
1. Issue or feature description
Whenever I try to build or run an NVidia container, Docker fails with the error message:
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
nvidia-container-cli -k -d /dev/tty info
uname -a
dmesg
nvidia-smi -a
docker version
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V
The text was updated successfully, but these errors were encountered: