-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reload failed actors from checkpoint #3818
Comments
I've already implemented this internally, mostly based on this design. I can move the code to github soon. Hope you didn't spend much time on this. Sorry for not informing you beforehand. |
@stephanie-wang sure, I'll start working on this today. Apologize for the inconvenience. |
No problem, thanks! By the way, do you also have an implementation for reloading from a checkpoint without replaying the most recent tasks? If yes, it would be great if you could open a PR for that as well. Thank you! |
With my current API, this can be done by making |
That's fine we can implement other storage options in another PR.
Just to clarify, you mean |
@ujvl yeah, that's right. |
I see, I think it would be great to have an API explicitly for reloading an actor from a checkpoint without replay. Also, it would be useful to be able to create a new actor from an old actor's checkpoint. |
Agreed. I have thought about how to do that though. |
From a backend perspective, it seems like creating a new actor from an old actor's checkpoint would be pretty easy. I think it could actually be implemented entirely in the Python/Java client. It's not entirely clear to me how we can natively support reloading an existing actor without replay. If the actor only has one caller, then it seems pretty easy to do, but not so sure about multiple callers. One proposal is that if an actor task gets resubmitted, then we just pretend that the task has been executed (e.g., by treating its dummy object return values as local), so that the next task can run immediately. Regardless, I think it's best that we start with @raulchen's suggestion of doing a mock checkpoint on every call so that we implement and test the API, but it would be nice to remove the per-task overhead in the future. |
Hi, I'm a bot from the Ray team :) To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months. If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel. |
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message. Please feel free to reopen or open a new issue if you'd still like it to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for opening the issue! |
Describe the problem
Ability to restart a failed actor from its most recent checkpoint and replay all methods since then.
The text was updated successfully, but these errors were encountered: