Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: NCCL error in #574

Open
leidriver201120 opened this issue Jun 12, 2022 · 1 comment
Open

RuntimeError: NCCL error in #574

leidriver201120 opened this issue Jun 12, 2022 · 1 comment

Comments

@leidriver201120
Copy link

当我在linux服务器上用两个GPU尝试train的时候,遇到一个报错,
return torch._C._dist_broadcast(tensor, src, group)
RuntimeErrorreturn torch._C._dist_broadcast(tensor, src, group):
NCCL error in: /opt/conda/conda-bld/pytorch_1535493744281/work/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:600, unhandled cuda error
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1535493744281/work/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:600, unhandled cuda error
这个报错来源于train.py的main函数中的dist_model = DistModule(mode)。
然后命令为
CUDA_VISIBLE_DEVICES=1,2 python -m torch.distributed.launch --nproc_per_node=2 --master_port=2333 tools/train.py --cfg experiments/siamrpn_r50_l234_dwxcorr_8gpu/config.yaml
我在网上寻找了答案,有人说把它改为使用单GPU解决了问题,但是这样不是就失去了多GPU并行跑的优势了吗
目前我还未找到解决的办法,如果有人知道解决的方法,劳烦解答,不甚感激

@Sourabh9468
Copy link

by installing cudatoolkit=10.2 module may remove this error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants