Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reload failed actors from checkpoint #3818

Closed
stephanie-wang opened this issue Jan 21, 2019 · 13 comments
Closed

Reload failed actors from checkpoint #3818

stephanie-wang opened this issue Jan 21, 2019 · 13 comments
Assignees
Labels
enhancement Request for new feature and/or capability stale The issue is stale. It will be closed within 7 days unless there are further conversation

Comments

@stephanie-wang
Copy link
Contributor

Describe the problem

Ability to restart a failed actor from its most recent checkpoint and replay all methods since then.

@raulchen
Copy link
Contributor

I've already implemented this internally, mostly based on this design. I can move the code to github soon. Hope you didn't spend much time on this. Sorry for not informing you beforehand.

@raulchen raulchen self-assigned this Jan 23, 2019
@stephanie-wang
Copy link
Contributor Author

I see, @raulchen, can you open a PR ASAP? @ujvl was going to implement this in the backend, but maybe he can review instead.

@raulchen
Copy link
Contributor

@stephanie-wang sure, I'll start working on this today. Apologize for the inconvenience.

@stephanie-wang
Copy link
Contributor Author

No problem, thanks!

By the way, do you also have an implementation for reloading from a checkpoint without replaying the most recent tasks? If yes, it would be great if you could open a PR for that as well. Thank you!

@raulchen
Copy link
Contributor

reloading from a checkpoint without replaying the most recent tasks

With my current API, this can be done by making should_checkpoint return True for every task, but don't really save the checkpoint to external storage in save_checkpoint. However, this includes small overhead of saving actor's state to GCS for each task.

@raulchen
Copy link
Contributor

BTW, I didn't implement checkpoint storage as @ujvl wanted to do in #3825. But I agree that could be useful for some users.

@ujvl
Copy link
Contributor

ujvl commented Jan 24, 2019

That's fine we can implement other storage options in another PR.

With my current API, this can be done by making should_checkpoint return True for every task, but don't really save the checkpoint to external storage in save_checkpoint.

Just to clarify, you mean save_checkpoint would still save checkpoint data on some calls, but would save actor state every call, right? I think there's probably a way to do it without that overhead but it might be better to discuss that after looking at your PR.

@raulchen
Copy link
Contributor

Just to clarify, you mean save_checkpoint would still save checkpoint data on some calls, but would save actor state every call, right?

@ujvl yeah, that's right.

@stephanie-wang
Copy link
Contributor Author

I see, I think it would be great to have an API explicitly for reloading an actor from a checkpoint without replay. Also, it would be useful to be able to create a new actor from an old actor's checkpoint.

@raulchen
Copy link
Contributor

I see, I think it would be great to have an API explicitly for reloading an actor from a checkpoint without replay. Also, it would be useful to be able to create a new actor from an old actor's checkpoint.

Agreed. I have thought about how to do that though.

@stephanie-wang
Copy link
Contributor Author

From a backend perspective, it seems like creating a new actor from an old actor's checkpoint would be pretty easy. I think it could actually be implemented entirely in the Python/Java client.

It's not entirely clear to me how we can natively support reloading an existing actor without replay. If the actor only has one caller, then it seems pretty easy to do, but not so sure about multiple callers. One proposal is that if an actor task gets resubmitted, then we just pretend that the task has been executed (e.g., by treating its dummy object return values as local), so that the next task can run immediately.

Regardless, I think it's best that we start with @raulchen's suggestion of doing a mock checkpoint on every call so that we implement and test the API, but it would be nice to remove the per-task overhead in the future.

@ericl ericl added enhancement Request for new feature and/or capability and removed feature request labels Mar 5, 2020
@stale
Copy link

stale bot commented Nov 14, 2020

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 14, 2020
@stale
Copy link

stale bot commented Nov 28, 2020

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

@stale stale bot closed this as completed Nov 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

No branches or pull requests

4 participants