Implement actor checkpointing #3839

raulchen · 2019-01-24T12:06:57Z

What do these changes do?

This PR implements actor checkpointing. To enable checkpointing, users should inherent their actor classes from the Checkpointable interface. See Checkpointable definition in actor.py/Checkpointable.java for more details.

Related issue number

#3818

AmplabJenkins · 2019-01-24T12:31:36Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11111/
Test FAILed.

AmplabJenkins · 2019-01-26T05:06:10Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11173/
Test FAILed.

raulchen · 2019-01-26T08:01:29Z

@stephanie-wang @ujvl This PR is ready for review.

AmplabJenkins · 2019-01-26T08:15:33Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11176/
Test FAILed.

AmplabJenkins · 2019-01-26T09:00:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11178/
Test FAILed.

AmplabJenkins · 2019-01-26T09:27:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11180/
Test FAILed.

ujvl · 2019-01-28T10:32:27Z

Thanks! I'll take a pass through this later today.

stephanie-wang

Thanks, this looks great so far!

stephanie-wang · 2019-01-28T20:53:07Z

src/ray/raylet/node_manager.cc

+  // latest finished task's dummy object in the checkpoint. We may want to consolidate
+  // these 2 call sites.
+  ActorRegistration actor_registration = actor_entry->second;
+  actor_registration.ExtendFrontier(actor_handle_id, dummy_object);


It might make sense to separate out the following code to populate checkpoint_data into a method of ActorRegistration.

stephanie-wang · 2019-01-28T21:44:38Z

src/ray/raylet/node_manager.cc

+          // Mark the unreleased dummy objects as local.
+          for (const auto &entry : actor_entry->second.GetDummyObjects()) {
+            HandleObjectLocal(entry.first);
+          }


I'm not entirely sure how to test it, but I think it is possible for tasks from before the checkpoint to be resubmitted and then get stuck in the WAITING queue. For example:

The actor takes a checkpoint after task i.

Task i+1 is submitted to the actor, but the actor dies.

The raylet detects the actor's death and caches task i+1.

The raylet reconstructs the actor and the application reloads it from the checkpoint.

The raylet publishes the actor's new location. Task i+1 gets resubmitted and the raylet listens for the task lease for task i, since task i+1 depends on it.

The raylet looks up the checkpoint data for the resumed checkpoint ID.

The task lease for task i expires, and task i gets resubmitted.

The raylet receives the checkpoint data and restores the frontier. Task i is now behind the frontier, and its dependencies will never appear, so it will remain in the WAITING queue forever.

One way to fix this is to add code here to iterate through the task queues and remove any tasks that occur before the checkpoint frontier. Another way is to wait until we receive the checkpoint frontier before calling HandleActorStateTransition and PublishActorStateTransition to resubmit any cached actor tasks, which i think is a little nicer.

Good catch, this is indeed a problem. I think I'll probably fix it by moving the "restore-from-checkpoint" part to HandleActorStateTransition. Because looking up a checkpoint is an async operation, we need to wake up the actor tasks in the callback of the lookup.

stephanie-wang · 2019-01-28T21:45:03Z

src/ray/raylet/node_manager.cc

+                           << " for actor " << actor_id << " in GCS. This is likely"
+                           << " because the worker sent us a wrong or expired"
+                           << " checkpoint id.";
+          // TODO(hchen): what should we do here? Notify or kill the actor?


It probably makes sense to kill the actor, since it's pretty unclear what kind of semantics we could guarantee otherwise.

I just realized that I've already checked whether the returned checkpoint id is valid at the front end. I can simply do a RAY_LOG(FATAL) here.

python/ray/includes/unique_ids.pxi

stephanie-wang · 2019-01-28T21:58:53Z

python/ray/worker.py

+            return
+        actor_id = self.actor_id
+        actor = self.actors[actor_id]
+        # An actor that needs checkpointing must inherent the `Checkpointable`


Suggested change

# An actor that needs checkpointing must inherent the `Checkpointable`

# An actor that needs checkpointing must inherit from the `Checkpointable`

stephanie-wang · 2019-01-28T22:02:24Z

@ericl, can you take a look at the Python API?

AmplabJenkins · 2019-01-29T08:18:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11263/
Test FAILed.

ujvl

Couple minor comments, otherwise looks good!

src/ray/gcs/format/gcs.fbs

ujvl · 2019-01-29T09:12:55Z

python/ray/worker.py

+                    self.raylet_client.notify_actor_resumed_from_checkpoint(
+                        actor_id, checkpoint_id)
+        elif is_actor_task:
+            self._num_tasks_since_last_checkpoint += 1


Would multiple actors be updating the same _num_tasks_since_last_checkpoint and _last_checkpoint_timestamp? Each actor should have its own, right?

No. Because one worker can only have at most one actor at the same time.

Might still be cleaner to move those variables to the actor in case that assumption changes in the future, unless there's a good reason for keeping it here.

ujvl · 2019-01-29T10:07:07Z

src/ray/raylet/node_manager.cc

-      // empty lineage this time.
-      SubmitTask(method, Lineage());
+    // The actor's location is now known.
+    bool resumed_from_checkpoint = checkpoint_id_to_restore_.count(actor_id) > 0;


nit, slight preference for find over count for clarity with unordered_map

I use count because it's shorter and can fit in one line. Do you know any pros of using find over count? I think in terms of efficiency, they should be the same.

jovany-wang · 2019-01-29T15:07:41Z

src/ray/gcs/tables.cc

+      const auto &checkpoint_id = UniqueID::from_binary(*copy->checkpoint_ids.begin());
+      RAY_LOG(DEBUG) << "Deleting checkpoint " << checkpoint_id << " for actor " << actor_id;
+      copy->timestamps.erase(copy->timestamps.begin());
+      copy->checkpoint_ids.erase(copy->checkpoint_ids.begin());


Since checkpoint_ids is a long string concatenated objects:

checkpoint_ids: [string];

we should not erase the begin().
begin() is just the first char of the string.

Currently, checkpoint_ids is defined as a list of strings, using begin() is fine. But I should concatenate them as one single string. Thanks for reminding

jovany-wang · 2019-01-29T15:08:47Z

src/ray/gcs/tables.cc

+    auto num_to_keep = RayConfig::instance().num_actor_checkpoints_to_keep();
+    while (copy->timestamps.size() > num_to_keep) {
+      // Delete the checkpoint from actor checkpoint table.
+      const auto &checkpoint_id = UniqueID::from_binary(*copy->checkpoint_ids.begin());


jovany-wang · 2019-01-29T15:11:19Z

src/ray/gcs/format/gcs.fbs

+  // ID of this actor.
+  actor_id: string;
+  // A list of the available checkpoint IDs for this actor.
+  checkpoint_ids: [string];


Since we often insert id to checkpoint_ids , and remove elements from checkpoint_ids.
Is it better to define an Objects table like this?

table Objects { Object[] objects; } table Object { byte[] object; }

stephanie-wang · 2019-01-29T21:47:45Z

src/ray/raylet/node_manager.cc

+    } else {
+      // If this actor was resumed from a checkpoint, look up the checkpoint in GCS,
+      // retore actor state, and resubmit the waiting tasks.
+      const auto checkpoint_id = checkpoint_id_to_restore_[actor_id];


Hmm I'm not sure if this will work in a distributed setting, since the publish could still go out to other nodes, which will resubmit their cached tasks, and that could potentially happen before we restore the checkpoint here.

Ah, yes, that problem could still happen.

stephanie-wang · 2019-01-29T21:56:01Z

One question about the API: the backend only keeps around the last n checkpoints for an actor, but it seems like we should notify the application when older checkpoints get garbage-collected in the GCS, right? Have you thought about how we should do that?

raulchen · 2019-01-30T05:57:31Z

One question about the API: the backend only keeps around the last n checkpoints for an actor, but it seems like we should notify the application when older checkpoints get garbage-collected in the GCS, right? Have you thought about how we should do that?

We pass in the available_checkpoints parameter to load_checkpoint, so users can know which checkpoints are still available. This should be okay?

stephanie-wang · 2019-01-30T06:16:24Z

One question about the API: the backend only keeps around the last n checkpoints for an actor, but it seems like we should notify the application when older checkpoints get garbage-collected in the GCS, right? Have you thought about how we should do that?

We pass in the available_checkpoints parameter to load_checkpoint, so users can know which checkpoints are still available. This should be okay?

I meant that we should probably let the user know when checkpoints get GC'ed in the GCS, so that they can GC the application checkpoint data.

raulchen · 2019-01-30T07:33:50Z

@stephanie-wang I see.
I can make the num_checkpoitns_to_keep config available to user code. When they save a new app-level checkpoint, they can clean up the old ones. Is this okay?

ujvl · 2019-01-30T09:29:11Z

When they save a new app-level checkpoint, they can clean up the old ones. Is this okay?

You could add another callback function to the Checkpointable interface like cleanup_checkpoint(checkpoint_id) that's implemented by the user if they want to do GC, we can call it in _handle_actor_checkpoint after save_checkpoint. That way they don't need to explicitly assume which checkpoint is garbage collected when they take a new checkpoint.

stephanie-wang · 2019-01-30T22:49:07Z

When they save a new app-level checkpoint, they can clean up the old ones. Is this okay?

You could add another callback function to the Checkpointable interface like cleanup_checkpoint(checkpoint_id) that's implemented by the user if they want to do GC, we can call it in _handle_actor_checkpoint after save_checkpoint. That way they don't need to explicitly assume which checkpoint is garbage collected when they take a new checkpoint.

+1

Actually, we should probably advise the user about how to write checkpoint data. For instance, if they write the checkpoint data in place (like you currently have in the Python test), it could break if num_actor_checkpoints_to_keep=1 since the application checkpoint is not atomic with the backend checkpoint. Not to say that we should enforce that in this PR, but it'd be good to add this to the online documentation.

stephanie-wang · 2019-01-30T22:53:16Z

src/ray/ray_config_def.h

@@ -135,3 +135,6 @@ RAY_CONFIG(int, num_workers_per_process, 1);

 /// Maximum timeout in milliseconds within which a task lease must be renewed.
 RAY_CONFIG(int64_t, max_task_lease_timeout_ms, 60000);
+
+/// Maximum number of checkpoints to keep in GCS for an actor.
+RAY_CONFIG(uint32_t, num_actor_checkpoints_to_keep, 200);


How about defaulting this to 2? If the application is writing its checkpoint data in place, then that would guarantee that the stored checkpoint is always in the backend's available_checkpoints. If the application isn't writing its checkpoint data in place and therefore needs to GC old checkpoints, then this would minimize the amount of checkpoint data per actor.

raulchen · 2019-02-01T08:07:11Z

src/ray/raylet/node_manager.cc

+                                  "likely due to reconstruction.";
+            }
+            SubmitTask(task, Lineage());
+          }


@stephanie-wang I ended up fixing the issue by resubmitting the waiting task here.

Hmm okay, I think I prefer the other solution I mentioned because it seems cleaner, but maybe I'm wrong. I'll give it a shot and push to the PR if it seems doable.

raulchen · 2019-02-01T08:08:03Z

python/ray/actor.py

+        pass
+
+    @abstractmethod
+    def checkpoint_expired(self, checkpoint_id):


@ujvl @stephanie-wang added a checkpoint_expired callback here.

think you need to pass in the actor_id in here as well if it's used to locate the data (since save_checkpoint may use it in that way).

yep, makes sense.

raulchen · 2019-02-01T08:09:53Z

src/ray/ray_config_def.h

+/// Maximum number of checkpoints to keep in GCS for an actor.
+/// Note: this number should be set to at least 2. Because saving a application
+/// checkpoint isn't atomic with saving the backend checkpoint, and it will break
+/// if this number is set to 1 and users save application checkpoints in place.


@stephanie-wang I added a note here to warn about setting num_actor_checkpoints_to_keep=1.
Also decreased the default value to 20 to reduce overhead. I think 2 might be too small for some users.

AmplabJenkins · 2019-02-01T08:41:39Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11393/
Test FAILed.

AmplabJenkins · 2019-02-01T08:43:11Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11394/
Test FAILed.

raulchen · 2019-02-01T09:56:55Z

@jovany-wang Java part is done, please help take a look. thanks

…de to FunctionActionManager

raulchen · 2019-02-08T06:36:47Z

@pschafhalter thanks! the comments are addressed, or replied if I have questions. could you take a look again? thank you.

AmplabJenkins · 2019-02-08T07:19:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11679/
Test FAILed.

raulchen · 2019-02-08T09:46:05Z

@stephanie-wang I think now we can deprecate the __ray_checkpoint__ magic method. We can use the following way to do the same functionality:

def checkpoint(self):
    self._should_checkpoint = True

def should_checkpoint(self, checkpoint_context):
    return self._should_checkpoint

actor.checkpoint.remote()

Also, by deprecating this, we can remove a bunch of condition checks and unneeded code.

Another small comment is that maybe we should move _save_actor_checkpoint and _restore_actor_checkpoint to function_manager.py. Because it's weird that they are defined as private methods of Worker class, but never used in Worker.

AmplabJenkins · 2019-02-08T10:02:13Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11686/
Test FAILed.

stephanie-wang · 2019-02-11T22:37:26Z

Sounds good, I can try that now (and fix the conflict too).

AmplabJenkins · 2019-02-12T00:31:13Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11802/
Test FAILed.

raulchen · 2019-02-12T06:16:54Z

Sounds good, I can try that now (and fix the conflict too).

thanks! I think this PR is ready for merge if CI passes.

AmplabJenkins · 2019-02-12T06:37:44Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11824/
Test FAILed.

stephanie-wang

Awesome job!

AmplabJenkins · 2019-02-12T09:29:07Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11825/
Test PASSed.

AmplabJenkins · 2019-02-13T00:36:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11840/
Test PASSed.

AmplabJenkins · 2019-02-13T04:50:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11854/
Test FAILed.

AmplabJenkins · 2019-02-13T06:30:25Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11856/
Test FAILed.

stephanie-wang self-assigned this Jan 25, 2019

stephanie-wang requested a review from ujvl January 25, 2019 18:18

raulchen force-pushed the actor_checkpoint branch from 25a065d to 20cbb1f Compare January 26, 2019 04:42

raulchen changed the title ~~[WIP] Implement actor checkpointing~~ Implement actor checkpointing Jan 26, 2019

raulchen force-pushed the actor_checkpoint branch from 0f969fd to 95a47fe Compare January 26, 2019 09:02

stephanie-wang requested changes Jan 28, 2019

View reviewed changes

raulchen force-pushed the actor_checkpoint branch from 95a47fe to 9aa4447 Compare January 29, 2019 07:56

ujvl reviewed Jan 29, 2019

View reviewed changes

jovany-wang reviewed Jan 29, 2019

View reviewed changes

stephanie-wang reviewed Jan 29, 2019

View reviewed changes

stephanie-wang reviewed Jan 30, 2019

View reviewed changes

raulchen commented Feb 1, 2019

View reviewed changes

stephanie-wang and others added 5 commits February 8, 2019 14:25

Remove logging

33f07ab

Remove old actor checkpointing Python code, move new checkpointing co…

e16a4bb

…de to FunctionActionManager

Replace old actor checkpointing tests

c80889a

Fix test and lint

8bf4b61

address comments

1d64d55

raulchen force-pushed the actor_checkpoint branch from e40dbc8 to 1d64d55 Compare February 8, 2019 06:34

consolidate kill_actor

10999c5

stephanie-wang added 2 commits February 11, 2019 15:20

Merge branch 'master' into actor_checkpoint

ee58192

Remove __ray_checkpoint__

9c7da6d

Merge branch 'master' into actor_checkpoint

821fb8c

fix non-ascii char

e938845

stephanie-wang approved these changes Feb 12, 2019

View reviewed changes

pschafhalter approved these changes Feb 12, 2019

View reviewed changes

Loosen test checks

6e3985f

raulchen added 2 commits February 13, 2019 11:21

fix java

c70a499

fix sphinx-build

428d8b5

raulchen merged commit f31a79f into ray-project:master Feb 13, 2019

raulchen deleted the actor_checkpoint branch February 13, 2019 11:39

	# An actor that needs checkpointing must inherent the `Checkpointable`
	# An actor that needs checkpointing must inherit from the `Checkpointable`

Implement actor checkpointing #3839

Implement actor checkpointing #3839

Conversation

raulchen commented Jan 24, 2019 • edited Loading

What do these changes do?

Related issue number

AmplabJenkins commented Jan 24, 2019

AmplabJenkins commented Jan 26, 2019

raulchen commented Jan 26, 2019

AmplabJenkins commented Jan 26, 2019

AmplabJenkins commented Jan 26, 2019

AmplabJenkins commented Jan 26, 2019

ujvl commented Jan 28, 2019

stephanie-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang commented Jan 28, 2019

AmplabJenkins commented Jan 29, 2019

ujvl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jovany-wang Jan 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang commented Jan 29, 2019

raulchen commented Jan 30, 2019

stephanie-wang commented Jan 30, 2019

raulchen commented Jan 30, 2019

ujvl commented Jan 30, 2019

stephanie-wang commented Jan 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulchen Feb 1, 2019 • edited Loading

Choose a reason for hiding this comment

AmplabJenkins commented Feb 1, 2019

AmplabJenkins commented Feb 1, 2019

raulchen commented Feb 1, 2019

raulchen commented Feb 8, 2019

AmplabJenkins commented Feb 8, 2019

raulchen commented Feb 8, 2019

AmplabJenkins commented Feb 8, 2019

stephanie-wang commented Feb 11, 2019

AmplabJenkins commented Feb 12, 2019

raulchen commented Feb 12, 2019

AmplabJenkins commented Feb 12, 2019

stephanie-wang left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Feb 12, 2019

AmplabJenkins commented Feb 13, 2019

AmplabJenkins commented Feb 13, 2019

AmplabJenkins commented Feb 13, 2019

raulchen commented Jan 24, 2019 •

edited

Loading

jovany-wang Jan 29, 2019 •

edited

Loading

raulchen Feb 1, 2019 •

edited

Loading