-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assign new id when we restart a task? #2752
Comments
我也曾经想过这个问题,只是一个 |
每个task 对应唯一的id时符合情理的,而且这个task被重放之后id不应该变化,这样可以在job结束之后知道每个task运行了几次,哪些没有被运行。 生成唯一id的方法可以参考uuid的RFC https://stackoverflow.com/questions/1785503/when-should-i-use-uuid-uuid1-vs-uuid-uuid4-in-python |
有冲突。有的汇报成功,有的汇报失败,次序还有变化,该如何处理? |
同意 @typhoonzero From @gongweibao
这里的冲突是说:Task-a被一个Trainer获取并训练失败,在汇报时因为网络原因一直失败,master等待超时后又将Task-a加入到TODO队列里并被其他Trainer获取并训练, 这时第一个Trainer又可以正常的汇报task状态,导致Task-a会出现在Pending和Failed两个队列里面么? 如果是这样的话,感觉唯一的ID更是必要的了吧,Trainer汇报时检查一下是不是在Pending队列里,如果在的话就舍弃掉,再获取一个新的task来训练就好了? |
|
我考虑了一下 首先要明确 The life cycle of a single task is illustrated below这个图严格来说是有问题的,他只能说明
如果做简化:
|
先考虑实现简单的版本吧。 |
赞讨论!倾向于"如果做简化"提出的几点。 |
#2719 (comment)
The text was updated successfully, but these errors were encountered: