-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn on resource deadlock; improve object store error messages #5555
Conversation
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
@@ -336,6 +336,7 @@ void NodeManager::Heartbeat() { | |||
static_cast<int64_t>(now_ms - last_debug_dump_at_ms_) > debug_dump_period_) { | |||
DumpDebugState(); | |||
RecordMetrics(); | |||
WarnResourceDeadlock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call this in DispatchTasks
instead, and only call it if we weren't able to dispatch any tasks. Otherwise, we'll end up pushing an error every heartbeat for as long as the deadlock is happening (which is probably forever).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried doing this, but it ended up printing out too many false positives. The issue is that you only want to fire the warning if there has been a significant delay, and right after DispatchTasks is not it (though if you wait even a tiny bit of time resources could immediately free up, like if a task returns right after creating an actor).
I think printing forever is fine -- it is deadlocked after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe I'm misunderstanding how DispatchTasks works, but if DispatchTasks doesn't succeed in scheduling anything, and the conditions that you check are true (no running tasks), then doesn't that mean it'll never succeed again? I think when this happens, it can only be because there are no available workers or if all the cores are taken up by actors. We can make sure it's the second case if we also check that there are no resources available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily, a block/unblock could free up resources, here is the example I was testing:
import ray
@ray.remote(num_cpus=1)
class A:
def f(self): pass
@ray.remote
def f():
a = A.remote()
b = A.remote()
c = A.remote()
print("get 1")
ray.get(a.f.remote())
print("get 2")
ray.get(b.f.remote())
print("get 3")
ray.get(c.f.remote())
ray.init(num_cpus=2)
ray.get(f.remote())
// Progress is being made, don't warn. | ||
return; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add an additional check that there are no resources available on the local node. The RUNNING
queue can also be empty if there are no workers available (because they haven't started yet or are all died).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that but it could be the actor is not placeable due to its size of resources. We could try checking for that too, but it seems simpler to rely on the periodic delay to avoid the warning firing too often.
Test PASSed. |
Why are these changes needed?
The detection of resource deadlock is somewhat subtle. We periodically check if there are no active running tasks. If this is the case, and there also happen to be tasks queued on this node in READY state, then it means that those tasks are likely to be queued indefinitely since the resources on the local node must be occupied by lifetime-resource actors.
There might be some false positives, e.g., if the actors are going to be destroyed soon, or the cluster is scaling up, so also change the messages from ERROR to WARNING level.
Related issue number
Closes #5468
Linter
scripts/format.sh
to lint the changes in this PR.