Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement actor checkpointing #3839

Merged
merged 38 commits into from
Feb 13, 2019
Merged

Conversation

raulchen
Copy link
Contributor

@raulchen raulchen commented Jan 24, 2019

What do these changes do?

This PR implements actor checkpointing. To enable checkpointing, users should inherent their actor classes from the Checkpointable interface. See Checkpointable definition in actor.py/Checkpointable.java for more details.

Related issue number

#3818

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11111/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11173/
Test FAILed.

@raulchen raulchen changed the title [WIP] Implement actor checkpointing Implement actor checkpointing Jan 26, 2019
@raulchen
Copy link
Contributor Author

@stephanie-wang @ujvl This PR is ready for review.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11176/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11178/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11180/
Test FAILed.

@ujvl
Copy link
Contributor

ujvl commented Jan 28, 2019

Thanks! I'll take a pass through this later today.

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks great so far!

// latest finished task's dummy object in the checkpoint. We may want to consolidate
// these 2 call sites.
ActorRegistration actor_registration = actor_entry->second;
actor_registration.ExtendFrontier(actor_handle_id, dummy_object);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to separate out the following code to populate checkpoint_data into a method of ActorRegistration.

// Mark the unreleased dummy objects as local.
for (const auto &entry : actor_entry->second.GetDummyObjects()) {
HandleObjectLocal(entry.first);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure how to test it, but I think it is possible for tasks from before the checkpoint to be resubmitted and then get stuck in the WAITING queue. For example:

  1. The actor takes a checkpoint after task i.
  2. Task i+1 is submitted to the actor, but the actor dies.
  3. The raylet detects the actor's death and caches task i+1.
  4. The raylet reconstructs the actor and the application reloads it from the checkpoint.
  5. The raylet publishes the actor's new location. Task i+1 gets resubmitted and the raylet listens for the task lease for task i, since task i+1 depends on it.
  6. The raylet looks up the checkpoint data for the resumed checkpoint ID.
  7. The task lease for task i expires, and task i gets resubmitted.
  8. The raylet receives the checkpoint data and restores the frontier. Task i is now behind the frontier, and its dependencies will never appear, so it will remain in the WAITING queue forever.

One way to fix this is to add code here to iterate through the task queues and remove any tasks that occur before the checkpoint frontier. Another way is to wait until we receive the checkpoint frontier before calling HandleActorStateTransition and PublishActorStateTransition to resubmit any cached actor tasks, which i think is a little nicer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this is indeed a problem. I think I'll probably fix it by moving the "restore-from-checkpoint" part to HandleActorStateTransition. Because looking up a checkpoint is an async operation, we need to wake up the actor tasks in the callback of the lookup.

<< " for actor " << actor_id << " in GCS. This is likely"
<< " because the worker sent us a wrong or expired"
<< " checkpoint id.";
// TODO(hchen): what should we do here? Notify or kill the actor?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably makes sense to kill the actor, since it's pretty unclear what kind of semantics we could guarantee otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that I've already checked whether the returned checkpoint id is valid at the front end. I can simply do a RAY_LOG(FATAL) here.

python/ray/includes/unique_ids.pxi Show resolved Hide resolved
return
actor_id = self.actor_id
actor = self.actors[actor_id]
# An actor that needs checkpointing must inherent the `Checkpointable`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# An actor that needs checkpointing must inherent the `Checkpointable`
# An actor that needs checkpointing must inherit from the `Checkpointable`

@stephanie-wang
Copy link
Contributor

@ericl, can you take a look at the Python API?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11263/
Test FAILed.

Copy link
Contributor

@ujvl ujvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple minor comments, otherwise looks good!

src/ray/gcs/format/gcs.fbs Show resolved Hide resolved
self.raylet_client.notify_actor_resumed_from_checkpoint(
actor_id, checkpoint_id)
elif is_actor_task:
self._num_tasks_since_last_checkpoint += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would multiple actors be updating the same _num_tasks_since_last_checkpoint and _last_checkpoint_timestamp? Each actor should have its own, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Because one worker can only have at most one actor at the same time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might still be cleaner to move those variables to the actor in case that assumption changes in the future, unless there's a good reason for keeping it here.

// empty lineage this time.
SubmitTask(method, Lineage());
// The actor's location is now known.
bool resumed_from_checkpoint = checkpoint_id_to_restore_.count(actor_id) > 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, slight preference for find over count for clarity with unordered_map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use count because it's shorter and can fit in one line. Do you know any pros of using find over count? I think in terms of efficiency, they should be the same.

const auto &checkpoint_id = UniqueID::from_binary(*copy->checkpoint_ids.begin());
RAY_LOG(DEBUG) << "Deleting checkpoint " << checkpoint_id << " for actor " << actor_id;
copy->timestamps.erase(copy->timestamps.begin());
copy->checkpoint_ids.erase(copy->checkpoint_ids.begin());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since checkpoint_ids is a long string concatenated objects:

checkpoint_ids: [string];

we should not erase the begin().
begin() is just the first char of the string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, checkpoint_ids is defined as a list of strings, using begin() is fine. But I should concatenate them as one single string. Thanks for reminding

auto num_to_keep = RayConfig::instance().num_actor_checkpoints_to_keep();
while (copy->timestamps.size() > num_to_keep) {
// Delete the checkpoint from actor checkpoint table.
const auto &checkpoint_id = UniqueID::from_binary(*copy->checkpoint_ids.begin());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here.

// ID of this actor.
actor_id: string;
// A list of the available checkpoint IDs for this actor.
checkpoint_ids: [string];
Copy link
Contributor

@jovany-wang jovany-wang Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we often insert id to checkpoint_ids , and remove elements from checkpoint_ids.
Is it better to define an Objects table like this?

table Objects {
    Object[] objects;
}
table Object {
    byte[] object;
}

} else {
// If this actor was resumed from a checkpoint, look up the checkpoint in GCS,
// retore actor state, and resubmit the waiting tasks.
const auto checkpoint_id = checkpoint_id_to_restore_[actor_id];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm not sure if this will work in a distributed setting, since the publish could still go out to other nodes, which will resubmit their cached tasks, and that could potentially happen before we restore the checkpoint here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, that problem could still happen.

@stephanie-wang
Copy link
Contributor

One question about the API: the backend only keeps around the last n checkpoints for an actor, but it seems like we should notify the application when older checkpoints get garbage-collected in the GCS, right? Have you thought about how we should do that?

@raulchen
Copy link
Contributor Author

One question about the API: the backend only keeps around the last n checkpoints for an actor, but it seems like we should notify the application when older checkpoints get garbage-collected in the GCS, right? Have you thought about how we should do that?

We pass in the available_checkpoints parameter to load_checkpoint, so users can know which checkpoints are still available. This should be okay?

@stephanie-wang
Copy link
Contributor

One question about the API: the backend only keeps around the last n checkpoints for an actor, but it seems like we should notify the application when older checkpoints get garbage-collected in the GCS, right? Have you thought about how we should do that?

We pass in the available_checkpoints parameter to load_checkpoint, so users can know which checkpoints are still available. This should be okay?

I meant that we should probably let the user know when checkpoints get GC'ed in the GCS, so that they can GC the application checkpoint data.

@raulchen
Copy link
Contributor Author

@stephanie-wang I see.
I can make the num_checkpoitns_to_keep config available to user code. When they save a new app-level checkpoint, they can clean up the old ones. Is this okay?

@ujvl
Copy link
Contributor

ujvl commented Jan 30, 2019

When they save a new app-level checkpoint, they can clean up the old ones. Is this okay?

You could add another callback function to the Checkpointable interface like cleanup_checkpoint(checkpoint_id) that's implemented by the user if they want to do GC, we can call it in _handle_actor_checkpoint after save_checkpoint. That way they don't need to explicitly assume which checkpoint is garbage collected when they take a new checkpoint.

@stephanie-wang
Copy link
Contributor

When they save a new app-level checkpoint, they can clean up the old ones. Is this okay?

You could add another callback function to the Checkpointable interface like cleanup_checkpoint(checkpoint_id) that's implemented by the user if they want to do GC, we can call it in _handle_actor_checkpoint after save_checkpoint. That way they don't need to explicitly assume which checkpoint is garbage collected when they take a new checkpoint.

+1

Actually, we should probably advise the user about how to write checkpoint data. For instance, if they write the checkpoint data in place (like you currently have in the Python test), it could break if num_actor_checkpoints_to_keep=1 since the application checkpoint is not atomic with the backend checkpoint. Not to say that we should enforce that in this PR, but it'd be good to add this to the online documentation.

@@ -135,3 +135,6 @@ RAY_CONFIG(int, num_workers_per_process, 1);

/// Maximum timeout in milliseconds within which a task lease must be renewed.
RAY_CONFIG(int64_t, max_task_lease_timeout_ms, 60000);

/// Maximum number of checkpoints to keep in GCS for an actor.
RAY_CONFIG(uint32_t, num_actor_checkpoints_to_keep, 200);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about defaulting this to 2? If the application is writing its checkpoint data in place, then that would guarantee that the stored checkpoint is always in the backend's available_checkpoints. If the application isn't writing its checkpoint data in place and therefore needs to GC old checkpoints, then this would minimize the amount of checkpoint data per actor.

"likely due to reconstruction.";
}
SubmitTask(task, Lineage());
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephanie-wang I ended up fixing the issue by resubmitting the waiting task here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm okay, I think I prefer the other solution I mentioned because it seems cleaner, but maybe I'm wrong. I'll give it a shot and push to the PR if it seems doable.

pass

@abstractmethod
def checkpoint_expired(self, checkpoint_id):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujvl @stephanie-wang added a checkpoint_expired callback here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think you need to pass in the actor_id in here as well if it's used to locate the data (since save_checkpoint may use it in that way).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, makes sense.

/// Maximum number of checkpoints to keep in GCS for an actor.
/// Note: this number should be set to at least 2. Because saving a application
/// checkpoint isn't atomic with saving the backend checkpoint, and it will break
/// if this number is set to 1 and users save application checkpoints in place.
Copy link
Contributor Author

@raulchen raulchen Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephanie-wang I added a note here to warn about setting num_actor_checkpoints_to_keep=1.
Also decreased the default value to 20 to reduce overhead. I think 2 might be too small for some users.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11393/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11394/
Test FAILed.

@raulchen
Copy link
Contributor Author

raulchen commented Feb 1, 2019

@jovany-wang Java part is done, please help take a look. thanks

@raulchen
Copy link
Contributor Author

raulchen commented Feb 8, 2019

@pschafhalter thanks! the comments are addressed, or replied if I have questions. could you take a look again? thank you.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11679/
Test FAILed.

@raulchen
Copy link
Contributor Author

raulchen commented Feb 8, 2019

@stephanie-wang I think now we can deprecate the __ray_checkpoint__ magic method. We can use the following way to do the same functionality:

def checkpoint(self):
    self._should_checkpoint = True

def should_checkpoint(self, checkpoint_context):
    return self._should_checkpoint

actor.checkpoint.remote()

Also, by deprecating this, we can remove a bunch of condition checks and unneeded code.

Another small comment is that maybe we should move _save_actor_checkpoint and _restore_actor_checkpoint to function_manager.py. Because it's weird that they are defined as private methods of Worker class, but never used in Worker.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11686/
Test FAILed.

@stephanie-wang
Copy link
Contributor

Sounds good, I can try that now (and fix the conflict too).

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11802/
Test FAILed.

@raulchen
Copy link
Contributor Author

Sounds good, I can try that now (and fix the conflict too).

thanks! I think this PR is ready for merge if CI passes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11824/
Test FAILed.

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job!

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11825/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11840/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11854/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11856/
Test FAILed.

@raulchen raulchen merged commit f31a79f into ray-project:master Feb 13, 2019
@raulchen raulchen deleted the actor_checkpoint branch February 13, 2019 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants