Reload failed actors from checkpoint #3818

stephanie-wang · 2019-01-21T20:06:33Z

Describe the problem

Ability to restart a failed actor from its most recent checkpoint and replay all methods since then.

raulchen · 2019-01-23T14:10:07Z

I've already implemented this internally, mostly based on this design. I can move the code to github soon. Hope you didn't spend much time on this. Sorry for not informing you beforehand.

stephanie-wang · 2019-01-23T19:49:13Z

I see, @raulchen, can you open a PR ASAP? @ujvl was going to implement this in the backend, but maybe he can review instead.

raulchen · 2019-01-24T02:08:46Z

@stephanie-wang sure, I'll start working on this today. Apologize for the inconvenience.

stephanie-wang · 2019-01-24T02:27:15Z

No problem, thanks!

By the way, do you also have an implementation for reloading from a checkpoint without replaying the most recent tasks? If yes, it would be great if you could open a PR for that as well. Thank you!

raulchen · 2019-01-24T03:12:03Z

reloading from a checkpoint without replaying the most recent tasks

With my current API, this can be done by making should_checkpoint return True for every task, but don't really save the checkpoint to external storage in save_checkpoint. However, this includes small overhead of saving actor's state to GCS for each task.

raulchen · 2019-01-24T03:16:57Z

BTW, I didn't implement checkpoint storage as @ujvl wanted to do in #3825. But I agree that could be useful for some users.

ujvl · 2019-01-24T06:49:13Z

That's fine we can implement other storage options in another PR.

With my current API, this can be done by making should_checkpoint return True for every task, but don't really save the checkpoint to external storage in save_checkpoint.

Just to clarify, you mean save_checkpoint would still save checkpoint data on some calls, but would save actor state every call, right? I think there's probably a way to do it without that overhead but it might be better to discuss that after looking at your PR.

raulchen · 2019-01-24T06:52:01Z

Just to clarify, you mean save_checkpoint would still save checkpoint data on some calls, but would save actor state every call, right?

@ujvl yeah, that's right.

stephanie-wang · 2019-01-25T04:57:18Z

I see, I think it would be great to have an API explicitly for reloading an actor from a checkpoint without replay. Also, it would be useful to be able to create a new actor from an old actor's checkpoint.

raulchen · 2019-01-26T08:02:28Z

I see, I think it would be great to have an API explicitly for reloading an actor from a checkpoint without replay. Also, it would be useful to be able to create a new actor from an old actor's checkpoint.

Agreed. I have thought about how to do that though.

stephanie-wang · 2019-01-29T22:01:45Z

From a backend perspective, it seems like creating a new actor from an old actor's checkpoint would be pretty easy. I think it could actually be implemented entirely in the Python/Java client.

It's not entirely clear to me how we can natively support reloading an existing actor without replay. If the actor only has one caller, then it seems pretty easy to do, but not so sure about multiple callers. One proposal is that if an actor task gets resubmitted, then we just pretend that the task has been executed (e.g., by treating its dummy object return values as local), so that the next task can run immediately.

Regardless, I think it's best that we start with @raulchen's suggestion of doing a mock checkpoint on every call so that we implement and test the API, but it would be nice to remove the per-task overhead in the future.

stale · 2020-11-14T10:07:46Z

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

stale · 2020-11-28T10:12:54Z

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

stephanie-wang added the feature request label Jan 21, 2019

ujvl mentioned this issue Jan 22, 2019

[WIP] Actor checkpointing and failure recovery #3825

Closed

raulchen self-assigned this Jan 23, 2019

raulchen mentioned this issue Jan 24, 2019

Implement actor checkpointing #3839

Merged

ericl added enhancement Request for new feature and/or capability and removed feature request labels Mar 5, 2020

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 14, 2020

stale bot closed this as completed Nov 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reload failed actors from checkpoint #3818

Reload failed actors from checkpoint #3818

stephanie-wang commented Jan 21, 2019

raulchen commented Jan 23, 2019

stephanie-wang commented Jan 23, 2019

raulchen commented Jan 24, 2019

stephanie-wang commented Jan 24, 2019

raulchen commented Jan 24, 2019

raulchen commented Jan 24, 2019

ujvl commented Jan 24, 2019

raulchen commented Jan 24, 2019

stephanie-wang commented Jan 25, 2019

raulchen commented Jan 26, 2019

stephanie-wang commented Jan 29, 2019

stale bot commented Nov 14, 2020

stale bot commented Nov 28, 2020

Reload failed actors from checkpoint #3818

Reload failed actors from checkpoint #3818

Comments

stephanie-wang commented Jan 21, 2019

Describe the problem

raulchen commented Jan 23, 2019

stephanie-wang commented Jan 23, 2019

raulchen commented Jan 24, 2019

stephanie-wang commented Jan 24, 2019

raulchen commented Jan 24, 2019

raulchen commented Jan 24, 2019

ujvl commented Jan 24, 2019

raulchen commented Jan 24, 2019

stephanie-wang commented Jan 25, 2019

raulchen commented Jan 26, 2019

stephanie-wang commented Jan 29, 2019

stale bot commented Nov 14, 2020

stale bot commented Nov 28, 2020