Some multi-GPU unit tests hang when running in a different Docker environment #5963

hcho3 · 2020-07-30T21:30:10Z

#5873 (comment)
Log from 4-process setup:

pid = 21916, Device = 0
pid = 21919, Device = 0
pid = 21921, Device = 0
pid = 21924, Device = 0

~~even though here we should have been using 4 GPUs.~~ See #5963 (comment).

The undefined behavior exists in following tests:

tests/python-gpu/test_gpu_with_dask.py::TestDistributedGPU::test_dask_array
tests/distributed/runtests-gpu.sh

The behavior is "undefined" in the sense that using a different Docker container causes the tests to fail, even though they were succeeding previously.

Passing (current CI setup):

tests/ci_build/ci_build.sh gpu_build_centos6 docker --build-arg CUDA_VERSION=10.0 \
  tests/ci_build/build_via_cmake.sh  -DUSE_CUDA=ON -DUSE_NCCL=ON \
  -DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON -DGPU_COMPUTE_VER=75
tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2 \
  tests/ci_build/test_python.sh mgpu

Failing (the test just hangs):

tests/ci_build/ci_build.sh gpu_build docker --build-arg CUDA_VERSION=10.2 \
  tests/ci_build/build_via_cmake.sh  -DUSE_CUDA=ON -DUSE_NCCL=ON \
  -DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON -DGPU_COMPUTE_VER=75
tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2 \
  tests/ci_build/test_python.sh mgpu

The text was updated successfully, but these errors were encountered:

trivialfis · 2020-07-30T21:41:19Z

How did you obtain the device ID? Dask assign GPU by setting CUDA VISIBLE DEVICE, so inside XGBoost we are always using first visible device, which is 0

hcho3 · 2020-07-30T21:46:04Z

@trivialfis I used cudaGetDevice(). See #5873 (comment).

Dask assign GPU by setting CUDA VISIBLE DEVICE, so inside XGBoost we are always using first visible device, which is 0

Thanks. That's good to know. On the other hand, I have no idea as to why test_dask_array should hang.

trivialfis · 2020-07-31T01:54:45Z

So, #5873 (comment) can we close this issue?

hcho3 · 2020-07-31T02:02:18Z

@trivialfis No, this issue is present in the master branch. You can try the command yourself:

tests/ci_build/ci_build.sh gpu_build docker --build-arg CUDA_VERSION=10.2 \
  tests/ci_build/build_via_cmake.sh  -DUSE_CUDA=ON -DUSE_NCCL=ON \
  -DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON -DGPU_COMPUTE_VER=75
tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2 \
  tests/ci_build/test_python.sh mgpu

hcho3 · 2020-07-31T09:05:48Z

I found the error by setting env var DMLC_WORKER_STOP_PROCESS_ON_ERROR=false:

/home/ubuntu/.local/lib/python3.7/site-packages/xgboost/core.py:1161: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   raise XGBoostError(py_str(_LIB.XGBGetLastError()))
E   xgboost.core.XGBoostError: [09:04:03] ../src/tree/updater_gpu_hist.cu:723: Exception in gpu_hist: NCCL failure :unhandled system error ../src/common/device_helpers.cu(71)
E   
E   Stack trace:
E     [bt] (0) /home/ubuntu/.local/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0xa4924) [0x7f40db349924]
E     [bt] (1) /home/ubuntu/.local/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x45a5a3) [0x7f40db6ff5a3]
E     [bt] (2) /home/ubuntu/.local/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x177dd2) [0x7f40db41cdd2]
E     [bt] (3) /home/ubuntu/.local/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x179e65) [0x7f40db41ee65]
E     [bt] (4) /home/ubuntu/.local/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x1a352b) [0x7f40db44852b]
E     [bt] (5) /home/ubuntu/.local/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x69) [0x7f40db33dc39]
E     [bt] (6) /opt/python/envs/gpu_test/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f43633f5630]
E     [bt] (7) /opt/python/envs/gpu_test/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f43633f4fed]
E     [bt] (8) /opt/python/envs/gpu_test/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2e7) [0x7f436340b6d7]

/home/ubuntu/.local/lib/python3.7/site-packages/xgboost/core.py:188: XGBoostError

Commands:

tests/ci_build/ci_build.sh gpu_build docker --build-arg CUDA_VERSION=10.2 \
  tests/ci_build/build_via_cmake.sh  -DUSE_CUDA=ON -DUSE_NCCL=ON \
  -DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON -DGPU_COMPUTE_VER=75
CI_DOCKER_EXTRA_PARAMS_INIT='-e DMLC_WORKER_STOP_PROCESS_ON_ERROR=false' \
  tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2  \
  tests/ci_build/test_python.sh mgpu

hcho3 · 2020-07-31T09:23:25Z

I turned on extra diagnostics from NCCL, per suggestion of pytorch/pytorch#20313.

Command:

CI_DOCKER_EXTRA_PARAMS_INIT='-e NCCL_DEBUG=INFO -e DMLC_WORKER_STOP_PROCESS_ON_ERROR=false' \
   tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2 \
   tests/ci_build/test_python.sh mgpu

Error log:

7635c25c81cd:41590:41835 [0] NCCL INFO Channel 00 : 3[1c0] -> 2[1b0] via direct shared memory
7635c25c81cd:41588:41837 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
7635c25c81cd:41588:41837 [0] NCCL INFO include/shm.h:41 -> 2
7635c25c81cd:41588:41837 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-6c6b02bad5bfc129-0-1-0 (size 9637888)
7635c25c81cd:41585:41836 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
7635c25c81cd:41585:41836 [0] NCCL INFO include/shm.h:41 -> 2
7635c25c81cd:41585:41836 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-bade85d1d321b26-0-3-2 (size 9637888)

The posix_allocate() function is failing because there isn't enough space in the scratch space /dev/shm. See NVIDIA/nccl#290.

By default, Docker allocates only 64 MB for /dev/shm. Increase it to 2GB, and the test now passes. Use this command:

CI_DOCKER_EXTRA_PARAMS_INIT='--shm-size=2g'  tests/ci_build/ci_build.sh gpu nvidia-docker -it \
  --build-arg CUDA_VERSION=10.2   tests/ci_build/test_python.sh mgpu

hcho3 mentioned this issue Jul 30, 2020

RMM integration plugin #5873

Merged

hcho3 changed the title ~~In some multi-GPU unit tests, multiple processes use only GPU 0~~ Some multi-GPU unit tests hang when running in a different Docker environment Jul 30, 2020

hcho3 self-assigned this Jul 31, 2020

hcho3 added the Blocking label Jul 31, 2020

hcho3 mentioned this issue Jul 31, 2020

[CI] Assign larger /dev/shm to NCCL #5966

Merged

hcho3 closed this as completed in #5966 Jul 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some multi-GPU unit tests hang when running in a different Docker environment #5963

Some multi-GPU unit tests hang when running in a different Docker environment #5963

hcho3 commented Jul 30, 2020 •

edited

Loading

trivialfis commented Jul 30, 2020

hcho3 commented Jul 30, 2020 •

edited

Loading

trivialfis commented Jul 31, 2020 •

edited

Loading

hcho3 commented Jul 31, 2020

hcho3 commented Jul 31, 2020

hcho3 commented Jul 31, 2020

Some multi-GPU unit tests hang when running in a different Docker environment #5963

Some multi-GPU unit tests hang when running in a different Docker environment #5963

Comments

hcho3 commented Jul 30, 2020 • edited Loading

trivialfis commented Jul 30, 2020

hcho3 commented Jul 30, 2020 • edited Loading

trivialfis commented Jul 31, 2020 • edited Loading

hcho3 commented Jul 31, 2020

hcho3 commented Jul 31, 2020

hcho3 commented Jul 31, 2020

hcho3 commented Jul 30, 2020 •

edited

Loading

hcho3 commented Jul 30, 2020 •

edited

Loading

trivialfis commented Jul 31, 2020 •

edited

Loading