We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
当我在linux服务器上用两个GPU尝试train的时候,遇到一个报错, return torch._C._dist_broadcast(tensor, src, group) RuntimeErrorreturn torch._C._dist_broadcast(tensor, src, group): NCCL error in: /opt/conda/conda-bld/pytorch_1535493744281/work/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:600, unhandled cuda error RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1535493744281/work/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:600, unhandled cuda error 这个报错来源于train.py的main函数中的dist_model = DistModule(mode)。 然后命令为 CUDA_VISIBLE_DEVICES=1,2 python -m torch.distributed.launch --nproc_per_node=2 --master_port=2333 tools/train.py --cfg experiments/siamrpn_r50_l234_dwxcorr_8gpu/config.yaml 我在网上寻找了答案,有人说把它改为使用单GPU解决了问题,但是这样不是就失去了多GPU并行跑的优势了吗 目前我还未找到解决的办法,如果有人知道解决的方法,劳烦解答,不甚感激
The text was updated successfully, but these errors were encountered:
by installing cudatoolkit=10.2 module may remove this error.
Sorry, something went wrong.
No branches or pull requests
当我在linux服务器上用两个GPU尝试train的时候,遇到一个报错,
return torch._C._dist_broadcast(tensor, src, group)
RuntimeErrorreturn torch._C._dist_broadcast(tensor, src, group):
NCCL error in: /opt/conda/conda-bld/pytorch_1535493744281/work/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:600, unhandled cuda error
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1535493744281/work/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:600, unhandled cuda error
这个报错来源于train.py的main函数中的dist_model = DistModule(mode)。
然后命令为
CUDA_VISIBLE_DEVICES=1,2 python -m torch.distributed.launch --nproc_per_node=2 --master_port=2333 tools/train.py --cfg experiments/siamrpn_r50_l234_dwxcorr_8gpu/config.yaml
我在网上寻找了答案,有人说把它改为使用单GPU解决了问题,但是这样不是就失去了多GPU并行跑的优势了吗
目前我还未找到解决的办法,如果有人知道解决的方法,劳烦解答,不甚感激
The text was updated successfully, but these errors were encountered: