RuntimeError: NCCL error in #574

leidriver201120 · 2022-06-12T03:27:59Z

当我在linux服务器上用两个GPU尝试train的时候，遇到一个报错，
return torch._C._dist_broadcast(tensor, src, group)
RuntimeErrorreturn torch._C._dist_broadcast(tensor, src, group):
NCCL error in: /opt/conda/conda-bld/pytorch_1535493744281/work/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:600, unhandled cuda error
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1535493744281/work/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:600, unhandled cuda error
这个报错来源于train.py的main函数中的dist_model = DistModule(mode)。
然后命令为
CUDA_VISIBLE_DEVICES=1,2 python -m torch.distributed.launch --nproc_per_node=2 --master_port=2333 tools/train.py --cfg experiments/siamrpn_r50_l234_dwxcorr_8gpu/config.yaml
我在网上寻找了答案，有人说把它改为使用单GPU解决了问题，但是这样不是就失去了多GPU并行跑的优势了吗
目前我还未找到解决的办法，如果有人知道解决的方法，劳烦解答，不甚感激

Sourabh9468 · 2023-07-18T09:08:14Z

by installing cudatoolkit=10.2 module may remove this error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: NCCL error in #574

RuntimeError: NCCL error in #574

leidriver201120 commented Jun 12, 2022

Sourabh9468 commented Jul 18, 2023

RuntimeError: NCCL error in #574

RuntimeError: NCCL error in #574

Comments

leidriver201120 commented Jun 12, 2022

Sourabh9468 commented Jul 18, 2023