Warn on resource deadlock; improve object store error messages #5555

ericl · 2019-08-28T00:35:54Z

Why are these changes needed?

Improve several object store error messages
Warn if the node is full of actors and hence pending tasks or actors cannot be executed ("resource deadlock")

The detection of resource deadlock is somewhat subtle. We periodically check if there are no active running tasks. If this is the case, and there also happen to be tasks queued on this node in READY state, then it means that those tasks are likely to be queued indefinitely since the resources on the local node must be occupied by lifetime-resource actors.

There might be some false positives, e.g., if the actors are going to be destroyed soon, or the cluster is scaling up, so also change the messages from ERROR to WARNING level.

Related issue number

Closes #5468

Linter

I've run scripts/format.sh to lint the changes in this PR.

AmplabJenkins · 2019-08-28T04:24:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16582/
Test PASSed.

AmplabJenkins · 2019-08-28T04:38:16Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16581/
Test PASSed.

AmplabJenkins · 2019-08-30T04:54:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16653/
Test PASSed.

AmplabJenkins · 2019-08-30T05:39:48Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16654/
Test PASSed.

src/ray/raylet/node_manager.cc

stephanie-wang · 2019-08-30T17:58:52Z

src/ray/raylet/node_manager.cc

@@ -336,6 +336,7 @@ void NodeManager::Heartbeat() {
      static_cast<int64_t>(now_ms - last_debug_dump_at_ms_) > debug_dump_period_) {
    DumpDebugState();
    RecordMetrics();
+    WarnResourceDeadlock();


Let's call this in DispatchTasks instead, and only call it if we weren't able to dispatch any tasks. Otherwise, we'll end up pushing an error every heartbeat for as long as the deadlock is happening (which is probably forever).

I tried doing this, but it ended up printing out too many false positives. The issue is that you only want to fire the warning if there has been a significant delay, and right after DispatchTasks is not it (though if you wait even a tiny bit of time resources could immediately free up, like if a task returns right after creating an actor).

I think printing forever is fine -- it is deadlocked after all.

Hmm, maybe I'm misunderstanding how DispatchTasks works, but if DispatchTasks doesn't succeed in scheduling anything, and the conditions that you check are true (no running tasks), then doesn't that mean it'll never succeed again? I think when this happens, it can only be because there are no available workers or if all the cores are taken up by actors. We can make sure it's the second case if we also check that there are no resources available.

Not necessarily, a block/unblock could free up resources, here is the example I was testing:

import ray @ray.remote(num_cpus=1) class A: def f(self): pass @ray.remote def f(): a = A.remote() b = A.remote() c = A.remote() print("get 1") ray.get(a.f.remote()) print("get 2") ray.get(b.f.remote()) print("get 3") ray.get(c.f.remote()) ray.init(num_cpus=2) ray.get(f.remote())

stephanie-wang · 2019-08-30T20:37:26Z

src/ray/raylet/node_manager.cc

+    // Progress is being made, don't warn.
+    return;
+  }
+


Maybe add an additional check that there are no resources available on the local node. The RUNNING queue can also be empty if there are no workers available (because they haven't started yet or are all died).

I thought about that but it could be the actor is not placeable due to its size of resources. We could try checking for that too, but it seems simpler to rely on the periodic delay to avoid the warning firing too often.

AmplabJenkins · 2019-08-31T01:04:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16668/
Test PASSed.

ericl added 3 commits August 27, 2019 17:35

wip

e954f09

wip

988af55

wip

a6ccb0f

ericl added 3 commits August 29, 2019 12:00

wip

b704d1c

wip

3cabd1e

add impl

f4ac9e1

ericl assigned stephanie-wang Aug 30, 2019

ericl changed the title ~~[WIP] improve object store error messages~~ Warn on resource deadlock; improve object store error messages Aug 30, 2019

second

df4e893

stephanie-wang requested changes Aug 30, 2019

View reviewed changes

ericl added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Aug 30, 2019

stephanie-wang reviewed Aug 30, 2019

View reviewed changes

warn once

67ab719

stephanie-wang approved these changes Aug 30, 2019

View reviewed changes

ericl merged commit 3e70dab into ray-project:master Aug 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn on resource deadlock; improve object store error messages #5555

Warn on resource deadlock; improve object store error messages #5555

ericl commented Aug 28, 2019 •

edited

Loading

AmplabJenkins commented Aug 28, 2019

AmplabJenkins commented Aug 28, 2019

AmplabJenkins commented Aug 30, 2019

AmplabJenkins commented Aug 30, 2019

stephanie-wang Aug 30, 2019

ericl Aug 30, 2019

stephanie-wang Aug 30, 2019

ericl Aug 30, 2019

stephanie-wang Aug 30, 2019

ericl Aug 30, 2019

AmplabJenkins commented Aug 31, 2019

Warn on resource deadlock; improve object store error messages #5555

Warn on resource deadlock; improve object store error messages #5555

Conversation

ericl commented Aug 28, 2019 • edited Loading

Why are these changes needed?

Related issue number

Linter

AmplabJenkins commented Aug 28, 2019

AmplabJenkins commented Aug 28, 2019

AmplabJenkins commented Aug 30, 2019

AmplabJenkins commented Aug 30, 2019

stephanie-wang Aug 30, 2019

Choose a reason for hiding this comment

ericl Aug 30, 2019

Choose a reason for hiding this comment

stephanie-wang Aug 30, 2019

Choose a reason for hiding this comment

ericl Aug 30, 2019

Choose a reason for hiding this comment

stephanie-wang Aug 30, 2019

Choose a reason for hiding this comment

ericl Aug 30, 2019

Choose a reason for hiding this comment

AmplabJenkins commented Aug 31, 2019

ericl commented Aug 28, 2019 •

edited

Loading