NCCL WARN Call to posix_fallocate failed : No space left on device #290

apatsekin · 2020-02-05T02:52:46Z

Trying to run simple Horovod example for distributed GPU training (on single localhost).
Running inside nvidia TF container (without any extra software installations)

Steps to reproduce:

docker run -it -d --rm --net=host --mount type=bind,source=/home,destination=/home --runtime nvidia nvcr.io/nvidia/tensorflow:20.01-tf1-py3 sh
Entering this container: docker exec -it [container_tag] sh
cloning this repo: https://github.com/horovod/horovod/tree/v0.18.2
cd horovod/examples
mpirun --allow-run-as-root -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python tensorflow_mnist.py (from this tutorial)

Getting this error.

gpu2-ru7:2144:2556 [2] NCCL INFO Ring 00 : 2[3d000] -> 3[3e000] via direct shared memory
gpu2-ru7:2143:2561 [1] NCCL INFO Ring 00 : 1[1b000] -> 2[3d000] via direct shared memory
gpu2-ru7:2145:2555 [3] NCCL INFO Ring 00 : 3[3e000] -> 0[1a000] via direct shared memory
gpu2-ru7:2142:2554 [0] NCCL INFO Ring 00 : 0[1a000] -> 1[1b000] via direct shared memory
gpu2-ru7:2145:2555 [3] NCCL INFO Ring 00 : 3[3e000] -> 2[3d000] via direct shared memory

gpu2-ru7:2143:2561 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
gpu2-ru7:2143:2561 [1] NCCL INFO include/shm.h:41 -> 2

gpu2-ru7:2142:2554 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device

gpu2-ru7:2143:2561 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-cafdd889340823a4-0-2-1 (size 9637888)

gpu2-ru7:2143:2561 [1] NCCL INFO transport/shm.cc:99 -> 2
gpu2-ru7:2142:2554 [0] NCCL INFO include/shm.h:41 -> 2

gpu2-ru7:2142:2554 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-142514a12e41a3-0-1-0 (size 9637888)

Works fine for 1 or 2 GPus. Fails for everything 3+.
.df -h output from container:

Filesystem      Size  Used Avail Use% Mounted on
overlay         3.5T  2.8T  546G  84% /

There are 250 GB of RAM on this machine. 8 GPUs. Tried on two different machines with Nvidia 430 and 440 drivers.

Update: everything works fine with downgraded container nvcr.io/nvidia/tensorflow:19.10-py3 .

The text was updated successfully, but these errors were encountered:

sjeaugey · 2020-02-05T17:38:53Z

The error happens when NCCL tries to create a shared memory segment, which is just a file in /dev/shm. So it could be that inside your container, /dev/shm is not properly configured, causing that call to fail.

sjeaugey · 2020-04-15T01:56:04Z

Closing old issue. Feel free to re-open if needed.

russchua · 2020-06-17T09:22:19Z

Good day @sjeaugey I was just wondering what you mean by "/dev/shm" not being properly configured within a container? Do you have any tips on configuring it properly?

Thank you very much.

sjeaugey · 2020-06-17T17:01:25Z

That meant having /dev/shm from the host mapped in the containers at the same place, so that two containers could see files from each other.

russchua · 2020-06-18T02:34:35Z

Thank you very much! This has helped me resolve my issue. :)

zpcalan · 2020-11-07T07:49:43Z

Hi @sjeaugey , I mount /dev/shm and solved this issue.
But I wonder why older version of nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 does not have this problem. I didn't have to mount /dev/shm.

sjeaugey · 2020-11-09T16:49:49Z

Most probably, the older version of NCCL used P2P instead of shared memory to communicate through the CPU. Which was probably slower. If not you can set NCCL_P2P_LEVEL=NODE to revert the old behavior.

javdl · 2022-07-27T15:20:35Z

Setting the shm-size higher solved the issue for us when going from 2 GPUs to 3GPUs
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be insufficient for NVIDIA Caffe. NVIDIA recommends the use of the following flags: docker run --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

xloem · 2023-06-07T14:31:02Z

During application crashes, whether /dev/shm is properly mounted or not, it can accumulate temporary files. If the allocated space is exhausted then this error presents. Wiping the folder or restarting resolves that incident.

treya-lin · 2023-07-11T10:10:20Z

Hi I encountered similar errors. so how much --shm-size would be appropriate? Will it cosumes as much as 1g? I don't want it to exhaust when I am training stuff in the container. Thanks!

sjeaugey closed this as completed Apr 15, 2020

hcho3 mentioned this issue Jul 31, 2020

Some multi-GPU unit tests hang when running in a different Docker environment dmlc/xgboost#5963

Closed

ghost mentioned this issue Sep 29, 2020

Training Doesn't Start When Using Multiple GPUs on AWS EC2 ThilinaRajapakse/simpletransformers#742

Closed

tgaddair mentioned this issue Nov 6, 2020

ncclCommInitRank failed: unhandled system error with 3 local nodes (HOROVOD_GPU_OPERATIONS=NCCL, 0.20.3), hang with 4 nodes (HOROVOD_GPU_OPERATIONS=NCCL 0.20.3, HOROVOD_GPU_BROADCAST=NCCL 0.19.5) horovod/horovod#2395

Closed

luotao1 mentioned this issue Nov 9, 2020

set NCCL_SHM_DISABLE=1 for test_parallel_executor_profilery.py PaddlePaddle/Paddle#28484

Merged

nikitamaia mentioned this issue Nov 9, 2020

False out of resource error for multi-gpu machine tensorflow/tensorflow#44637

Closed

embray mentioned this issue May 25, 2021

NCCL Error when training with 2x 3090s. pytorch/pytorch#49095

Open

tboquet mentioned this issue Jun 20, 2021

Kubeflow - NCCL version 2.9.6+cuda11.0 -- NCCL WARN Call to posix_fallocate failed : No space left on device #525

Closed

maxhgerlach mentioned this issue Jan 4, 2022

[Error] horovod.common.exceptions.HorovodInternalError: ncclCommInitRank failed: unhandled system error horovod/horovod#3343

Closed

cdeepali mentioned this issue Mar 4, 2022

NCCL Error 2: unhandled system error of DataParallel pytorch/pytorch#73775

Closed

fduwjj mentioned this issue Aug 31, 2022

[SPMD] Enable GPU CI for Distributed Tensor pytorch/PiPPy#333

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL WARN Call to posix_fallocate failed : No space left on device #290

NCCL WARN Call to posix_fallocate failed : No space left on device #290

apatsekin commented Feb 5, 2020 •

edited

Loading

sjeaugey commented Feb 5, 2020

sjeaugey commented Apr 15, 2020

russchua commented Jun 17, 2020

sjeaugey commented Jun 17, 2020

russchua commented Jun 18, 2020

zpcalan commented Nov 7, 2020

sjeaugey commented Nov 9, 2020

javdl commented Jul 27, 2022 •

edited

Loading

xloem commented Jun 7, 2023

treya-lin commented Jul 11, 2023

NCCL WARN Call to posix_fallocate failed : No space left on device #290

NCCL WARN Call to posix_fallocate failed : No space left on device #290

Comments

apatsekin commented Feb 5, 2020 • edited Loading

sjeaugey commented Feb 5, 2020

sjeaugey commented Apr 15, 2020

russchua commented Jun 17, 2020

sjeaugey commented Jun 17, 2020

russchua commented Jun 18, 2020

zpcalan commented Nov 7, 2020

sjeaugey commented Nov 9, 2020

javdl commented Jul 27, 2022 • edited Loading

xloem commented Jun 7, 2023

treya-lin commented Jul 11, 2023

apatsekin commented Feb 5, 2020 •

edited

Loading

javdl commented Jul 27, 2022 •

edited

Loading