-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the libfabric EFA provider is operating in a condition that could result in memory corruption or other system errors. #63
Comments
Maybe the version of pytorch or cuda is incorrect |
The pytorch version is 1.13 and cuda is 11.7, which matches |
是多卡训练吗?多卡训练dist_utils.py那个节点gpu数要改成自己的gpu数,另外命令行的mpiexec -n 4的4也要换成自己的gpu数 |
不是,单卡,我甚至没有用mpiexec -n这个命令 |
添加环境变量RDMAV_FORK_SAFE吧看看,可能是为了安全不让直接fork子进程 |
OK,之后试试 |
在cm.train文件里添加了,但还是不行,报同样的错误 |
在/etc/profile里添加,作为系统环境变量 |
嗷嗷,OK |
记得保存后用source刷新一下 |
OK,感谢 |
When using CT mode for training, the following errors occur. Does anyone know how to solve them
The text was updated successfully, but these errors were encountered: