-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] Revisiting checkpointing policy #4287
Comments
I've had similar issues, especially when running off-policy algorithms like SAC from vision. Because in order to restore a trial, I need to save the full replay pool (or at least most of it), the checkpoint sizes grow very big. I've alleviated the issue a little by doing the checkpointing such that only the part of the replay pool that has not previously been saved gets checkpointed, but this doesn't solve the whole issue. Related to this: it would be nice to be able to specify filters for the |
Thanks for the discussion! I agree that trial disk-space usage can be a huge issue, but not totally clear what the API to expose here is. One option is to replace checkpoint_args=dict(
save_best=True, freq=5, monitor_attr="loss", save_most_recent=True, max_to_keep=N) Another option is to expose things like hooks/callbacks (https://keras.io/callbacks/#modelcheckpoint) instead of providing a million flags? This would be something like a CheckpointPolicy callback that decides whether to checkpoint the trial or not. However, this might be an overkill; let me know what you all think. Maybe we can open a new issue for the |
Thanks for the response! Callbacks do sound like the most general approach but I too feel like it might be overkill. I like the idea of the checkpoint_args, but I wonder if we could achieve most of the desired flexibility with less flags. How about letting the user decide when they want to checkpoint? Say at the end of a step, alongside the "done" flag, one could return:
Then the user can customize their own checkpoint policy directly in the Trainable object. It's easy for the user to keep track of the best and last model, so they would be able to do anything really. I suppose that's quite similar to the checkpoint policy solution you're suggesting actually. I need to think about this further.. :) On that note, I also wonder if the report dictionary should be decoupled from logging (sort of goes back to our conversation in #4157). |
Hm, users can already do that right? In a trainable, you can call Let me know when you come up with something :)
There's an API we're exploring that deals with this use case. The idea would be to have a separate logging mechanism that can be used freely, separate from the Tune runtime. We'll open an issue on this soon (and will notify you if interested). |
You're absolutely right actually. I'll manipulate |
BTW @jeremyasapp, #5064 actually ends up implementing what you originally suggested with Trainables controlling the checkpoint policy directly. This is nice because it allows Tune to manage the checkpoint and recover the trial effectively. In addition, #4490 automatically deletes checkpoints, though it may be a bit inefficient :) Please check those out! |
System information
Describe the problem
I was wondering if it'd be possible to get more information about the design decision of saving every checkpoint during training? When running large grid searches where each trial consists of many steps, disk space usage can blow up very quickly (I went beyond 100G very fast), which may lead to out of disk space errors when using AWS instances for example.
Instead, why not override the checkpoints every time? If the goal is persistence, then overriding the previous checkpoint would be reasonable. If the checkpoints are there to keep track of the best model (which I don't believe they are), then another more efficient startegy would be to always keep 2 checkpoints -> best checkpoint and last checkpoint (assuming "best" is defined).
This is actually what I do myself. I keep track of the best model my self directly in the Trainable, and always checkpoint it alongside the last model. So in theory, I could consistently override the checkpoint, and make disk space usage a function of the number of trials only, as opposed to also making it a function of the number of steps.
Please let me know if I'm missing something! And thank you in advance for the info.
The text was updated successfully, but these errors were encountered: