Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core/RuntimeEnv] Fix runtime environment hanging issues. #19823

Merged
merged 6 commits into from
Oct 29, 2021

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Oct 28, 2021

Why are these changes needed?

There are several issues with the current runtime environment that hangs scheduling. This PR handles all issues at once.

This is fixing 3 things

  • When jobs are started, it creates the runtime env although the empty runtime env was given.
  • When the agent is permanently dead by 5 times retrying with exponential backoff) or the minimal deps are not provided, it hangs forever.
    • We properly call callback in this case now.
  • The actor task is hanging forever if the actor is failed to be started due to the runtime env

Note that we are still raising RayActorError which doesn't provide the good error message. I will create an issue to propagate runtime env related errors to the driver separately and fix issues (#19824) in a separate PR so that I can minimize the change within one PR.

Related issue number

Closes #19514 #17558

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -487,6 +487,23 @@ void RayletBasedActorScheduler::HandleWorkerLeaseReply(
}

if (status.ok()) {
// The runtime environment could be created. It means this actor cannot be created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could or could not be created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I forgot to address it. could not be created. will fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned;

      // The runtime environment has failed by an unrecoverable error.
      // We cannot create this actor anymore.

@@ -487,6 +487,23 @@ void RayletBasedActorScheduler::HandleWorkerLeaseReply(
}

if (status.ok()) {
// The runtime environment could be created. It means this actor cannot be created.
if (reply.runtime_env_setup_failed()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is runtime_env_setup_failure non recoverable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are replying only upon unrecoverable failures (and if it is, it can keep retyring). I am not sure if it is actually true in the code path, but I've seen some parts that don't reply but retry within worker pool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the pattern itself, I am following the existing task code path (if this reply is received, mark the task as failed)

@rkooo567
Copy link
Contributor Author

mac test failures are transient. I will re-run it after getting initial feedback

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks for the fix!

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic, thanks so much for the fixes!

python/ray/tests/test_runtime_env.py Show resolved Hide resolved
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 29, 2021
@rkooo567
Copy link
Contributor Author

failed test (Windows) test_cancel and bare_metal_policy_with_custom_view_reqs are both flaky in the master

@rkooo567 rkooo567 merged commit 16dcff4 into ray-project:master Oct 29, 2021
@carsonwang
Copy link

Is there any plan to release Ray 1.8.1? It will be great to include this fix. Right now we can't make RayDP work with Ray 1.8.0 because of this issue.

@rkooo567
Copy link
Contributor Author

rkooo567 commented Nov 9, 2021

Cc @scv119 @ericl maybe we should have a .1 release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[runtime_env] Actor tasks aren't failed correctly when missing runtime_env dependencies
5 participants