Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL WARN Call to posix_fallocate failed : No space left on device #290

Closed
apatsekin opened this issue Feb 5, 2020 · 10 comments
Closed

Comments

@apatsekin
Copy link

apatsekin commented Feb 5, 2020

Trying to run simple Horovod example for distributed GPU training (on single localhost).
Running inside nvidia TF container (without any extra software installations)

Steps to reproduce:

  1. docker run -it -d --rm --net=host --mount type=bind,source=/home,destination=/home --runtime nvidia nvcr.io/nvidia/tensorflow:20.01-tf1-py3 sh
  2. Entering this container: docker exec -it [container_tag] sh
  3. cloning this repo: https://github.com/horovod/horovod/tree/v0.18.2
  4. cd horovod/examples
  5. mpirun --allow-run-as-root -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python tensorflow_mnist.py (from this tutorial)

Getting this error.

gpu2-ru7:2144:2556 [2] NCCL INFO Ring 00 : 2[3d000] -> 3[3e000] via direct shared memory
gpu2-ru7:2143:2561 [1] NCCL INFO Ring 00 : 1[1b000] -> 2[3d000] via direct shared memory
gpu2-ru7:2145:2555 [3] NCCL INFO Ring 00 : 3[3e000] -> 0[1a000] via direct shared memory
gpu2-ru7:2142:2554 [0] NCCL INFO Ring 00 : 0[1a000] -> 1[1b000] via direct shared memory
gpu2-ru7:2145:2555 [3] NCCL INFO Ring 00 : 3[3e000] -> 2[3d000] via direct shared memory

gpu2-ru7:2143:2561 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
gpu2-ru7:2143:2561 [1] NCCL INFO include/shm.h:41 -> 2

gpu2-ru7:2142:2554 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device

gpu2-ru7:2143:2561 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-cafdd889340823a4-0-2-1 (size 9637888)

gpu2-ru7:2143:2561 [1] NCCL INFO transport/shm.cc:99 -> 2
gpu2-ru7:2142:2554 [0] NCCL INFO include/shm.h:41 -> 2

gpu2-ru7:2142:2554 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-142514a12e41a3-0-1-0 (size 9637888)

Works fine for 1 or 2 GPus. Fails for everything 3+.
.df -h output from container:

Filesystem      Size  Used Avail Use% Mounted on
overlay         3.5T  2.8T  546G  84% /

There are 250 GB of RAM on this machine. 8 GPUs. Tried on two different machines with Nvidia 430 and 440 drivers.

Update: everything works fine with downgraded container nvcr.io/nvidia/tensorflow:19.10-py3 .

@sjeaugey
Copy link
Member

sjeaugey commented Feb 5, 2020

The error happens when NCCL tries to create a shared memory segment, which is just a file in /dev/shm. So it could be that inside your container, /dev/shm is not properly configured, causing that call to fail.

@sjeaugey
Copy link
Member

Closing old issue. Feel free to re-open if needed.

@russchua
Copy link

Good day @sjeaugey I was just wondering what you mean by "/dev/shm" not being properly configured within a container? Do you have any tips on configuring it properly?

Thank you very much.

@sjeaugey
Copy link
Member

That meant having /dev/shm from the host mapped in the containers at the same place, so that two containers could see files from each other.

@russchua
Copy link

Thank you very much! This has helped me resolve my issue. :)

@zpcalan
Copy link

zpcalan commented Nov 7, 2020

Hi @sjeaugey , I mount /dev/shm and solved this issue.
But I wonder why older version of nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 does not have this problem. I didn't have to mount /dev/shm.

@sjeaugey
Copy link
Member

sjeaugey commented Nov 9, 2020

Most probably, the older version of NCCL used P2P instead of shared memory to communicate through the CPU. Which was probably slower. If not you can set NCCL_P2P_LEVEL=NODE to revert the old behavior.

@javdl
Copy link

javdl commented Jul 27, 2022

Setting the shm-size higher solved the issue for us when going from 2 GPUs to 3GPUs
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be insufficient for NVIDIA Caffe. NVIDIA recommends the use of the following flags: docker run --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

@xloem
Copy link

xloem commented Jun 7, 2023

During application crashes, whether /dev/shm is properly mounted or not, it can accumulate temporary files. If the allocated space is exhausted then this error presents. Wiping the folder or restarting resolves that incident.

@treya-lin
Copy link

Hi I encountered similar errors. so how much --shm-size would be appropriate? Will it cosumes as much as 1g? I don't want it to exhaust when I am training stuff in the container. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants