Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] Revisiting checkpointing policy #4287

Closed
jeremyasapp opened this issue Mar 6, 2019 · 6 comments
Closed

[Tune] Revisiting checkpointing policy #4287

jeremyasapp opened this issue Mar 6, 2019 · 6 comments

Comments

@jeremyasapp
Copy link

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
  • Ray installed from (source or binary): pip installed
  • Ray version: 0.6.1
  • Python version: 3.6.5
  • Exact command to reproduce: N/A

Describe the problem

I was wondering if it'd be possible to get more information about the design decision of saving every checkpoint during training? When running large grid searches where each trial consists of many steps, disk space usage can blow up very quickly (I went beyond 100G very fast), which may lead to out of disk space errors when using AWS instances for example.

Instead, why not override the checkpoints every time? If the goal is persistence, then overriding the previous checkpoint would be reasonable. If the checkpoints are there to keep track of the best model (which I don't believe they are), then another more efficient startegy would be to always keep 2 checkpoints -> best checkpoint and last checkpoint (assuming "best" is defined).

This is actually what I do myself. I keep track of the best model my self directly in the Trainable, and always checkpoint it alongside the last model. So in theory, I could consistently override the checkpoint, and make disk space usage a function of the number of trials only, as opposed to also making it a function of the number of steps.

Please let me know if I'm missing something! And thank you in advance for the info.

@richardliaw richardliaw self-assigned this Mar 6, 2019
@hartikainen
Copy link
Contributor

I've had similar issues, especially when running off-policy algorithms like SAC from vision. Because in order to restore a trial, I need to save the full replay pool (or at least most of it), the checkpoint sizes grow very big.

I've alleviated the issue a little by doing the checkpointing such that only the part of the replay pool that has not previously been saved gets checkpointed, but this doesn't solve the whole issue.

Related to this: it would be nice to be able to specify filters for the --upload-dir rsync. For example, when I'm running real-world robotic experiments from vision, I want to save everything from the run locally in order to be able to restore the trial, but only want to upload the actual results (e.g. params.json and progress.csv) to cloud .

@richardliaw
Copy link
Contributor

richardliaw commented Mar 7, 2019

Thanks for the discussion!

I agree that trial disk-space usage can be a huge issue, but not totally clear what the API to expose here is.

One option is to replace checkpoint_freq and checkpoint_at_end with one config:

checkpoint_args=dict(
    save_best=True, freq=5, monitor_attr="loss", save_most_recent=True, max_to_keep=N)

Another option is to expose things like hooks/callbacks (https://keras.io/callbacks/#modelcheckpoint) instead of providing a million flags? This would be something like a CheckpointPolicy callback that decides whether to checkpoint the trial or not. However, this might be an overkill; let me know what you all think.

Maybe we can open a new issue for the rsync customization.

@richardliaw richardliaw changed the title [Tune] Checkpoint [Tune] Revisiting checkpointing policy Mar 7, 2019
@jeremyasapp
Copy link
Author

jeremyasapp commented Mar 7, 2019

Thanks for the response! Callbacks do sound like the most general approach but I too feel like it might be overkill. I like the idea of the checkpoint_args, but I wonder if we could achieve most of the desired flexibility with less flags.

How about letting the user decide when they want to checkpoint? Say at the end of a step, alongside the "done" flag, one could return:

report = dict(done=False, checkpoint=True, ...)

Then the user can customize their own checkpoint policy directly in the Trainable object. It's easy for the user to keep track of the best and last model, so they would be able to do anything really. I suppose that's quite similar to the checkpoint policy solution you're suggesting actually. I need to think about this further.. :)

On that note, I also wonder if the report dictionary should be decoupled from logging (sort of goes back to our conversation in #4157).

@richardliaw
Copy link
Contributor

richardliaw commented Mar 9, 2019

It's easy for the user to keep track of the best and last model, so they would be able to do anything really. I suppose that's quite similar to the checkpoint policy solution you're suggesting actually. I need to think about this further.. :)

Hm, users can already do that right? In a trainable, you can call self.save() and manipulate the logdir as you wish.

Let me know when you come up with something :)

On that note, I also wonder if the report dictionary should be decoupled from logging (sort of goes back to our conversation in #4157).

There's an API we're exploring that deals with this use case. The idea would be to have a separate logging mechanism that can be used freely, separate from the Tune runtime. We'll open an issue on this soon (and will notify you if interested).

@jeremyasapp
Copy link
Author

Hm, users can already do that right? In a trainable, you can call self.save() and manipulate the logdir as you wish

You're absolutely right actually. I'll manipulate save to do what I want. Feel free to close the issue!

@richardliaw
Copy link
Contributor

BTW @jeremyasapp, #5064 actually ends up implementing what you originally suggested with Trainables controlling the checkpoint policy directly. This is nice because it allows Tune to manage the checkpoint and recover the trial effectively. In addition, #4490 automatically deletes checkpoints, though it may be a bit inefficient :)

Please check those out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants