You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to save checkpoints in .nemo format, so I set exp_manager.checkpoint_callback_params.always_save_nemo: true. And it works in the sense that it saves the checkpoint as a .nemo, but the problem is that all the files get the same name, regardless of the step. Therefore, it constantly replaces the same .nemo file, and, in the end of the training, I just get the last checkpoint. So it is useless.
I have tried setting postfix: {step}.nemo like it is done with the megatron checkpoint, but it does not work. It does not seem like any .format() is called on it, so I don't know how to change the name depending on the step number.
Please fix this issue somehow.
The text was updated successfully, but these errors were encountered:
I'm using nemo 24.07 container to train LLMs.
I want to save checkpoints in
.nemo
format, so I setexp_manager.checkpoint_callback_params.always_save_nemo: true
. And it works in the sense that it saves the checkpoint as a.nemo
, but the problem is that all the files get the same name, regardless of the step. Therefore, it constantly replaces the same.nemo
file, and, in the end of the training, I just get the last checkpoint. So it is useless.I have tried setting
postfix: {step}.nemo
like it is done with the megatron checkpoint, but it does not work. It does not seem like any.format()
is called on it, so I don't know how to change the name depending on the step number.Please fix this issue somehow.
The text was updated successfully, but these errors were encountered: