[Core/RuntimeEnv] Fix runtime environment hanging issues. #19823

rkooo567 · 2021-10-28T03:34:00Z

Why are these changes needed?

There are several issues with the current runtime environment that hangs scheduling. This PR handles all issues at once.

This is fixing 3 things

When jobs are started, it creates the runtime env although the empty runtime env was given.
When the agent is permanently dead by 5 times retrying with exponential backoff) or the minimal deps are not provided, it hangs forever.
- We properly call callback in this case now.
The actor task is hanging forever if the actor is failed to be started due to the runtime env

Note that we are still raising RayActorError which doesn't provide the good error message. I will create an issue to propagate runtime env related errors to the driver separately and fix issues (#19824) in a separate PR so that I can minimize the change within one PR.

Related issue number

Closes #19514 #17558

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

scv119 · 2021-10-28T07:46:43Z

src/ray/gcs/gcs_server/gcs_actor_scheduler.cc

@@ -487,6 +487,23 @@ void RayletBasedActorScheduler::HandleWorkerLeaseReply(
    }

    if (status.ok()) {
+      // The runtime environment could be created. It means this actor cannot be created.


could or could not be created?

oh I forgot to address it. could not be created. will fix

Mentioned;

// The runtime environment has failed by an unrecoverable error. // We cannot create this actor anymore.

scv119 · 2021-10-28T07:53:43Z

src/ray/gcs/gcs_server/gcs_actor_scheduler.cc

@@ -487,6 +487,23 @@ void RayletBasedActorScheduler::HandleWorkerLeaseReply(
    }

    if (status.ok()) {
+      // The runtime environment could be created. It means this actor cannot be created.
+      if (reply.runtime_env_setup_failed()) {


is runtime_env_setup_failure non recoverable?

I think we are replying only upon unrecoverable failures (and if it is, it can keep retyring). I am not sure if it is actually true in the code path, but I've seen some parts that don't reply but retry within worker pool.

For the pattern itself, I am following the existing task code path (if this reply is received, mark the task as failed)

rkooo567 · 2021-10-28T15:20:05Z

mac test failures are transient. I will re-run it after getting initial feedback

architkulkarni

This looks good to me, thanks for the fix!

edoakes

This is fantastic, thanks so much for the fixes!

python/ray/tests/test_runtime_env.py

rkooo567 · 2021-10-29T14:01:52Z

failed test (Windows) test_cancel and bare_metal_policy_with_custom_view_reqs are both flaky in the master

carsonwang · 2021-11-09T02:33:28Z

Is there any plan to release Ray 1.8.1? It will be great to include this fix. Right now we can't make RayDP work with Ray 1.8.0 because of this issue.

rkooo567 · 2021-11-09T02:51:51Z

Cc @scv119 @ericl maybe we should have a .1 release?

rkooo567 added 2 commits October 27, 2021 20:11

done

0706bed

Add a right test

0b474c5

rkooo567 assigned scv119, architkulkarni and edoakes Oct 28, 2021

rkooo567 mentioned this pull request Oct 28, 2021

[Bug] Improve RuntimeEnvSetupError message #19824

Open

2 tasks

rkooo567 added 2 commits October 28, 2021 00:24

Merge branch 'master' into runtime-env-fixes

7ba30eb

Fix unit tests

16e8d5a

scv119 reviewed Oct 28, 2021

View reviewed changes

architkulkarni approved these changes Oct 28, 2021

View reviewed changes

edoakes approved these changes Oct 28, 2021

View reviewed changes

python/ray/tests/test_runtime_env.py Show resolved Hide resolved

scv119 approved these changes Oct 29, 2021

View reviewed changes

Merge branch 'master' into runtime-env-fixes

c584bb2

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 29, 2021

fix issues

5f70697

rkooo567 merged commit 16dcff4 into ray-project:master Oct 29, 2021

rkooo567 mentioned this pull request Nov 2, 2021

[Core] task scheduling hangs forever #19326

Closed

2 tasks

kira-lin mentioned this pull request Nov 8, 2021

[Ray Dataset] Update raydp test dependency #20142

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core/RuntimeEnv] Fix runtime environment hanging issues. #19823

[Core/RuntimeEnv] Fix runtime environment hanging issues. #19823

rkooo567 commented Oct 28, 2021 •

edited

Loading

scv119 Oct 28, 2021

rkooo567 Oct 29, 2021

rkooo567 Oct 29, 2021

scv119 Oct 28, 2021

rkooo567 Oct 28, 2021

rkooo567 Oct 28, 2021

rkooo567 commented Oct 28, 2021

architkulkarni left a comment

edoakes left a comment

rkooo567 commented Oct 29, 2021

carsonwang commented Nov 9, 2021

rkooo567 commented Nov 9, 2021

[Core/RuntimeEnv] Fix runtime environment hanging issues. #19823

[Core/RuntimeEnv] Fix runtime environment hanging issues. #19823

Conversation

rkooo567 commented Oct 28, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

scv119 Oct 28, 2021

Choose a reason for hiding this comment

rkooo567 Oct 29, 2021

Choose a reason for hiding this comment

rkooo567 Oct 29, 2021

Choose a reason for hiding this comment

scv119 Oct 28, 2021

Choose a reason for hiding this comment

rkooo567 Oct 28, 2021

Choose a reason for hiding this comment

rkooo567 Oct 28, 2021

Choose a reason for hiding this comment

rkooo567 commented Oct 28, 2021

architkulkarni left a comment

Choose a reason for hiding this comment

edoakes left a comment

Choose a reason for hiding this comment

rkooo567 commented Oct 29, 2021

carsonwang commented Nov 9, 2021

rkooo567 commented Nov 9, 2021

rkooo567 commented Oct 28, 2021 •

edited

Loading