Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some multi-GPU unit tests hang when running in a different Docker environment #5963

Closed
hcho3 opened this issue Jul 30, 2020 · 6 comments · Fixed by #5966
Closed

Some multi-GPU unit tests hang when running in a different Docker environment #5963

hcho3 opened this issue Jul 30, 2020 · 6 comments · Fixed by #5966
Assignees
Labels

Comments

@hcho3
Copy link
Collaborator

hcho3 commented Jul 30, 2020

#5873 (comment)
Log from 4-process setup:

pid = 21916, Device = 0
pid = 21919, Device = 0
pid = 21921, Device = 0
pid = 21924, Device = 0

even though here we should have been using 4 GPUs. See #5963 (comment).

The undefined behavior exists in following tests:

  • tests/python-gpu/test_gpu_with_dask.py::TestDistributedGPU::test_dask_array
  • tests/distributed/runtests-gpu.sh

The behavior is "undefined" in the sense that using a different Docker container causes the tests to fail, even though they were succeeding previously.

Passing (current CI setup):

tests/ci_build/ci_build.sh gpu_build_centos6 docker --build-arg CUDA_VERSION=10.0 \
  tests/ci_build/build_via_cmake.sh  -DUSE_CUDA=ON -DUSE_NCCL=ON \
  -DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON -DGPU_COMPUTE_VER=75
tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2 \
  tests/ci_build/test_python.sh mgpu

Failing (the test just hangs):

tests/ci_build/ci_build.sh gpu_build docker --build-arg CUDA_VERSION=10.2 \
  tests/ci_build/build_via_cmake.sh  -DUSE_CUDA=ON -DUSE_NCCL=ON \
  -DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON -DGPU_COMPUTE_VER=75
tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2 \
  tests/ci_build/test_python.sh mgpu
@trivialfis
Copy link
Member

How did you obtain the device ID? Dask assign GPU by setting CUDA VISIBLE DEVICE, so inside XGBoost we are always using first visible device, which is 0

@hcho3
Copy link
Collaborator Author

hcho3 commented Jul 30, 2020

@trivialfis I used cudaGetDevice(). See #5873 (comment).

Dask assign GPU by setting CUDA VISIBLE DEVICE, so inside XGBoost we are always using first visible device, which is 0

Thanks. That's good to know. On the other hand, I have no idea as to why test_dask_array should hang.

@hcho3 hcho3 changed the title In some multi-GPU unit tests, multiple processes use only GPU 0 Some multi-GPU unit tests hang when running in a different Docker environment Jul 30, 2020
@trivialfis
Copy link
Member

trivialfis commented Jul 31, 2020

So, #5873 (comment) can we close this issue?

@hcho3
Copy link
Collaborator Author

hcho3 commented Jul 31, 2020

@trivialfis No, this issue is present in the master branch. You can try the command yourself:

tests/ci_build/ci_build.sh gpu_build docker --build-arg CUDA_VERSION=10.2 \
  tests/ci_build/build_via_cmake.sh  -DUSE_CUDA=ON -DUSE_NCCL=ON \
  -DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON -DGPU_COMPUTE_VER=75
tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2 \
  tests/ci_build/test_python.sh mgpu

@hcho3
Copy link
Collaborator Author

hcho3 commented Jul 31, 2020

I found the error by setting env var DMLC_WORKER_STOP_PROCESS_ON_ERROR=false:

/home/ubuntu/.local/lib/python3.7/site-packages/xgboost/core.py:1161: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   raise XGBoostError(py_str(_LIB.XGBGetLastError()))
E   xgboost.core.XGBoostError: [09:04:03] ../src/tree/updater_gpu_hist.cu:723: Exception in gpu_hist: NCCL failure :unhandled system error ../src/common/device_helpers.cu(71)
E   
E   Stack trace:
E     [bt] (0) /home/ubuntu/.local/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0xa4924) [0x7f40db349924]
E     [bt] (1) /home/ubuntu/.local/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x45a5a3) [0x7f40db6ff5a3]
E     [bt] (2) /home/ubuntu/.local/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x177dd2) [0x7f40db41cdd2]
E     [bt] (3) /home/ubuntu/.local/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x179e65) [0x7f40db41ee65]
E     [bt] (4) /home/ubuntu/.local/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x1a352b) [0x7f40db44852b]
E     [bt] (5) /home/ubuntu/.local/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x69) [0x7f40db33dc39]
E     [bt] (6) /opt/python/envs/gpu_test/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f43633f5630]
E     [bt] (7) /opt/python/envs/gpu_test/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f43633f4fed]
E     [bt] (8) /opt/python/envs/gpu_test/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2e7) [0x7f436340b6d7]

/home/ubuntu/.local/lib/python3.7/site-packages/xgboost/core.py:188: XGBoostError

Commands:

tests/ci_build/ci_build.sh gpu_build docker --build-arg CUDA_VERSION=10.2 \
  tests/ci_build/build_via_cmake.sh  -DUSE_CUDA=ON -DUSE_NCCL=ON \
  -DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON -DGPU_COMPUTE_VER=75
CI_DOCKER_EXTRA_PARAMS_INIT='-e DMLC_WORKER_STOP_PROCESS_ON_ERROR=false' \
  tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2  \
  tests/ci_build/test_python.sh mgpu

@hcho3
Copy link
Collaborator Author

hcho3 commented Jul 31, 2020

I turned on extra diagnostics from NCCL, per suggestion of pytorch/pytorch#20313.

Command:

CI_DOCKER_EXTRA_PARAMS_INIT='-e NCCL_DEBUG=INFO -e DMLC_WORKER_STOP_PROCESS_ON_ERROR=false' \
   tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2 \
   tests/ci_build/test_python.sh mgpu

Error log:

7635c25c81cd:41590:41835 [0] NCCL INFO Channel 00 : 3[1c0] -> 2[1b0] via direct shared memory
7635c25c81cd:41588:41837 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
7635c25c81cd:41588:41837 [0] NCCL INFO include/shm.h:41 -> 2
7635c25c81cd:41588:41837 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-6c6b02bad5bfc129-0-1-0 (size 9637888)
7635c25c81cd:41585:41836 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
7635c25c81cd:41585:41836 [0] NCCL INFO include/shm.h:41 -> 2
7635c25c81cd:41585:41836 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-bade85d1d321b26-0-3-2 (size 9637888)

The posix_allocate() function is failing because there isn't enough space in the scratch space /dev/shm. See NVIDIA/nccl#290.

By default, Docker allocates only 64 MB for /dev/shm. Increase it to 2GB, and the test now passes. Use this command:

CI_DOCKER_EXTRA_PARAMS_INIT='--shm-size=2g'  tests/ci_build/ci_build.sh gpu nvidia-docker -it \
  --build-arg CUDA_VERSION=10.2   tests/ci_build/test_python.sh mgpu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants