-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL WARN Call to posix_fallocate failed : No space left on device #290
Comments
The error happens when NCCL tries to create a shared memory segment, which is just a file in /dev/shm. So it could be that inside your container, /dev/shm is not properly configured, causing that call to fail. |
Closing old issue. Feel free to re-open if needed. |
Good day @sjeaugey I was just wondering what you mean by "/dev/shm" not being properly configured within a container? Do you have any tips on configuring it properly? Thank you very much. |
That meant having /dev/shm from the host mapped in the containers at the same place, so that two containers could see files from each other. |
Thank you very much! This has helped me resolve my issue. :) |
Hi @sjeaugey , I mount /dev/shm and solved this issue. |
Most probably, the older version of NCCL used P2P instead of shared memory to communicate through the CPU. Which was probably slower. If not you can set NCCL_P2P_LEVEL=NODE to revert the old behavior. |
Setting the shm-size higher solved the issue for us when going from 2 GPUs to 3GPUs |
During application crashes, whether /dev/shm is properly mounted or not, it can accumulate temporary files. If the allocated space is exhausted then this error presents. Wiping the folder or restarting resolves that incident. |
Hi I encountered similar errors. so how much |
Trying to run simple Horovod example for distributed GPU training (on single localhost).
Running inside nvidia TF container (without any extra software installations)
Steps to reproduce:
docker run -it -d --rm --net=host --mount type=bind,source=/home,destination=/home --runtime nvidia nvcr.io/nvidia/tensorflow:20.01-tf1-py3 sh
docker exec -it [container_tag] sh
cd horovod/examples
mpirun --allow-run-as-root -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python tensorflow_mnist.py
(from this tutorial)Getting this error.
Works fine for 1 or 2 GPus. Fails for everything 3+.
.df -h output from container:
There are 250 GB of RAM on this machine. 8 GPUs. Tried on two different machines with Nvidia 430 and 440 drivers.
Update: everything works fine with downgraded container nvcr.io/nvidia/tensorflow:19.10-py3 .
The text was updated successfully, but these errors were encountered: