-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core/RuntimeEnv] Fix runtime environment hanging issues. #19823
Conversation
@@ -487,6 +487,23 @@ void RayletBasedActorScheduler::HandleWorkerLeaseReply( | |||
} | |||
|
|||
if (status.ok()) { | |||
// The runtime environment could be created. It means this actor cannot be created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could or could not be created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I forgot to address it. could not be created. will fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioned;
// The runtime environment has failed by an unrecoverable error.
// We cannot create this actor anymore.
@@ -487,6 +487,23 @@ void RayletBasedActorScheduler::HandleWorkerLeaseReply( | |||
} | |||
|
|||
if (status.ok()) { | |||
// The runtime environment could be created. It means this actor cannot be created. | |||
if (reply.runtime_env_setup_failed()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is runtime_env_setup_failure non recoverable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are replying only upon unrecoverable failures (and if it is, it can keep retyring). I am not sure if it is actually true in the code path, but I've seen some parts that don't reply but retry within worker pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the pattern itself, I am following the existing task code path (if this reply is received, mark the task as failed)
mac test failures are transient. I will re-run it after getting initial feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thanks for the fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic, thanks so much for the fixes!
failed test |
Is there any plan to release Ray 1.8.1? It will be great to include this fix. Right now we can't make RayDP work with Ray 1.8.0 because of this issue. |
Why are these changes needed?
There are several issues with the current runtime environment that hangs scheduling. This PR handles all issues at once.
This is fixing 3 things
Note that we are still raising
RayActorError
which doesn't provide the good error message. I will create an issue to propagate runtime env related errors to the driver separately and fix issues (#19824) in a separate PR so that I can minimize the change within one PR.Related issue number
Closes #19514 #17558
Checks
scripts/format.sh
to lint the changes in this PR.