-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
两台服务器,每台4张卡,训练出错 #115
Comments
这个不是错误。这应该是在等待10.10.11.51响应。 你在两台机器上启动的命令是什么?能贴一下吗? |
TRAINER_IP_LIST=10.10.11.50,10.10.11.51 |
你这两个机器是在一个集群环境中吗?平常有训练过多机任务么?看着是没问题的。可能是网络不通的问题?IP 地址是否是你的环境中的地址? |
网络是通的,平时没训练过多机任务 |
你确定是两台机器上分别执行了上面的启动命令吗? 多机的话,需要在每个机器上都执行启动命令 |
哦这样子啊,我试一下 |
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073'
The text was updated successfully, but these errors were encountered: