[Tune] Revisiting checkpointing policy #4287

jeremyasapp · 2019-03-06T19:38:54Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
Ray installed from (source or binary): pip installed
Ray version: 0.6.1
Python version: 3.6.5
Exact command to reproduce: N/A

Describe the problem

I was wondering if it'd be possible to get more information about the design decision of saving every checkpoint during training? When running large grid searches where each trial consists of many steps, disk space usage can blow up very quickly (I went beyond 100G very fast), which may lead to out of disk space errors when using AWS instances for example.

Instead, why not override the checkpoints every time? If the goal is persistence, then overriding the previous checkpoint would be reasonable. If the checkpoints are there to keep track of the best model (which I don't believe they are), then another more efficient startegy would be to always keep 2 checkpoints -> best checkpoint and last checkpoint (assuming "best" is defined).

This is actually what I do myself. I keep track of the best model my self directly in the Trainable, and always checkpoint it alongside the last model. So in theory, I could consistently override the checkpoint, and make disk space usage a function of the number of trials only, as opposed to also making it a function of the number of steps.

Please let me know if I'm missing something! And thank you in advance for the info.

hartikainen · 2019-03-06T23:05:08Z

I've had similar issues, especially when running off-policy algorithms like SAC from vision. Because in order to restore a trial, I need to save the full replay pool (or at least most of it), the checkpoint sizes grow very big.

I've alleviated the issue a little by doing the checkpointing such that only the part of the replay pool that has not previously been saved gets checkpointed, but this doesn't solve the whole issue.

Related to this: it would be nice to be able to specify filters for the --upload-dir rsync. For example, when I'm running real-world robotic experiments from vision, I want to save everything from the run locally in order to be able to restore the trial, but only want to upload the actual results (e.g. params.json and progress.csv) to cloud .

richardliaw · 2019-03-07T03:54:38Z

Thanks for the discussion!

I agree that trial disk-space usage can be a huge issue, but not totally clear what the API to expose here is.

One option is to replace checkpoint_freq and checkpoint_at_end with one config:

checkpoint_args=dict(
    save_best=True, freq=5, monitor_attr="loss", save_most_recent=True, max_to_keep=N)

Another option is to expose things like hooks/callbacks (https://keras.io/callbacks/#modelcheckpoint) instead of providing a million flags? This would be something like a CheckpointPolicy callback that decides whether to checkpoint the trial or not. However, this might be an overkill; let me know what you all think.

Maybe we can open a new issue for the rsync customization.

jeremyasapp · 2019-03-07T04:25:06Z

Thanks for the response! Callbacks do sound like the most general approach but I too feel like it might be overkill. I like the idea of the checkpoint_args, but I wonder if we could achieve most of the desired flexibility with less flags.

How about letting the user decide when they want to checkpoint? Say at the end of a step, alongside the "done" flag, one could return:

report = dict(done=False, checkpoint=True, ...)

Then the user can customize their own checkpoint policy directly in the Trainable object. It's easy for the user to keep track of the best and last model, so they would be able to do anything really. I suppose that's quite similar to the checkpoint policy solution you're suggesting actually. I need to think about this further.. :)

On that note, I also wonder if the report dictionary should be decoupled from logging (sort of goes back to our conversation in #4157).

richardliaw · 2019-03-09T02:31:38Z

It's easy for the user to keep track of the best and last model, so they would be able to do anything really. I suppose that's quite similar to the checkpoint policy solution you're suggesting actually. I need to think about this further.. :)

Hm, users can already do that right? In a trainable, you can call self.save() and manipulate the logdir as you wish.

Let me know when you come up with something :)

On that note, I also wonder if the report dictionary should be decoupled from logging (sort of goes back to our conversation in #4157).

There's an API we're exploring that deals with this use case. The idea would be to have a separate logging mechanism that can be used freely, separate from the Tune runtime. We'll open an issue on this soon (and will notify you if interested).

jeremyasapp · 2019-03-09T17:05:15Z

Hm, users can already do that right? In a trainable, you can call self.save() and manipulate the logdir as you wish

You're absolutely right actually. I'll manipulate save to do what I want. Feel free to close the issue!

richardliaw · 2019-07-09T00:25:44Z

BTW @jeremyasapp, #5064 actually ends up implementing what you originally suggested with Trainables controlling the checkpoint policy directly. This is nice because it allows Tune to manage the checkpoint and recover the trial effectively. In addition, #4490 automatically deletes checkpoints, though it may be a bit inefficient :)

Please check those out!

richardliaw self-assigned this Mar 6, 2019

richardliaw changed the title ~~[Tune] Checkpoint~~ [Tune] Revisiting checkpointing policy Mar 7, 2019

richardliaw removed their assignment Mar 14, 2019

This was referenced Mar 15, 2019

Add checkpoint eraserv2 #4380

Closed

[tune] Removing unnecessary checkpoints / freeing up disk space #4381

Closed

richardliaw mentioned this issue Mar 21, 2019

[tune] Allow only last N checkpoints to be kept in directory #3911

Closed

jodusan mentioned this issue Mar 27, 2019

[tune/rllib] Add checkpoint eraser #4490

Merged

richardliaw closed this as completed Jul 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tune] Revisiting checkpointing policy #4287

[Tune] Revisiting checkpointing policy #4287

jeremyasapp commented Mar 6, 2019

hartikainen commented Mar 6, 2019

richardliaw commented Mar 7, 2019 •

edited

Loading

jeremyasapp commented Mar 7, 2019 •

edited

Loading

richardliaw commented Mar 9, 2019 •

edited

Loading

jeremyasapp commented Mar 9, 2019

richardliaw commented Jul 9, 2019

[Tune] Revisiting checkpointing policy #4287

[Tune] Revisiting checkpointing policy #4287

Comments

jeremyasapp commented Mar 6, 2019

System information

Describe the problem

hartikainen commented Mar 6, 2019

richardliaw commented Mar 7, 2019 • edited Loading

jeremyasapp commented Mar 7, 2019 • edited Loading

richardliaw commented Mar 9, 2019 • edited Loading

jeremyasapp commented Mar 9, 2019

richardliaw commented Jul 9, 2019

richardliaw commented Mar 7, 2019 •

edited

Loading

jeremyasapp commented Mar 7, 2019 •

edited

Loading

richardliaw commented Mar 9, 2019 •

edited

Loading