Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn on resource deadlock; improve object store error messages #5555

Merged
merged 8 commits into from
Aug 30, 2019

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Aug 28, 2019

Why are these changes needed?

  • Improve several object store error messages
  • Warn if the node is full of actors and hence pending tasks or actors cannot be executed ("resource deadlock")

The detection of resource deadlock is somewhat subtle. We periodically check if there are no active running tasks. If this is the case, and there also happen to be tasks queued on this node in READY state, then it means that those tasks are likely to be queued indefinitely since the resources on the local node must be occupied by lifetime-resource actors.

There might be some false positives, e.g., if the actors are going to be destroyed soon, or the cluster is scaling up, so also change the messages from ERROR to WARNING level.

Related issue number

Closes #5468

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16582/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16581/
Test PASSed.

@ericl ericl changed the title [WIP] improve object store error messages Warn on resource deadlock; improve object store error messages Aug 30, 2019
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16653/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16654/
Test PASSed.

src/ray/raylet/node_manager.cc Show resolved Hide resolved
src/ray/raylet/node_manager.cc Show resolved Hide resolved
@@ -336,6 +336,7 @@ void NodeManager::Heartbeat() {
static_cast<int64_t>(now_ms - last_debug_dump_at_ms_) > debug_dump_period_) {
DumpDebugState();
RecordMetrics();
WarnResourceDeadlock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this in DispatchTasks instead, and only call it if we weren't able to dispatch any tasks. Otherwise, we'll end up pushing an error every heartbeat for as long as the deadlock is happening (which is probably forever).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried doing this, but it ended up printing out too many false positives. The issue is that you only want to fire the warning if there has been a significant delay, and right after DispatchTasks is not it (though if you wait even a tiny bit of time resources could immediately free up, like if a task returns right after creating an actor).

I think printing forever is fine -- it is deadlocked after all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe I'm misunderstanding how DispatchTasks works, but if DispatchTasks doesn't succeed in scheduling anything, and the conditions that you check are true (no running tasks), then doesn't that mean it'll never succeed again? I think when this happens, it can only be because there are no available workers or if all the cores are taken up by actors. We can make sure it's the second case if we also check that there are no resources available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily, a block/unblock could free up resources, here is the example I was testing:

import ray

@ray.remote(num_cpus=1)
class A:
    def f(self): pass


@ray.remote
def f():
    a = A.remote()
    b = A.remote()
    c = A.remote()
    print("get 1")
    ray.get(a.f.remote())
    print("get 2")
    ray.get(b.f.remote())
    print("get 3")
    ray.get(c.f.remote())

ray.init(num_cpus=2)
ray.get(f.remote())

@ericl ericl added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Aug 30, 2019
// Progress is being made, don't warn.
return;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an additional check that there are no resources available on the local node. The RUNNING queue can also be empty if there are no workers available (because they haven't started yet or are all died).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that but it could be the actor is not placeable due to its size of resources. We could try checking for that too, but it seems simpler to rely on the periodic delay to avoid the warning firing too often.

@ericl ericl merged commit 3e70dab into ray-project:master Aug 30, 2019
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16668/
Test PASSed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hang without any error message
3 participants