Data Loading bug in pretrain
on resume over multiple epochs
#1712
Labels
bug
Something isn't working
pretrain
on resume over multiple epochs
#1712
Bug description
The
pretrain
command currently saves and loads (on resume) thetrain_dataloader
as part of model state. The issue is that CycleIterator is initialized after reloading thetrain_dataloader
:litgpt/litgpt/pretrain.py
Line 282 in 1d37f9a
and it takes the current state as the "starting point" for all future epochs. What this means in practice is lets say an experiment crashed 90% of the way through the dataset, on resume the first epoch will run as expected, but the subsequent epochs will only go through the last 10% of the data repeatedly. The other smaller inconvenience is that CycleIterator is how the current code tracks epochs, and since this state is not saved, if the crash happens anywhere other than Epoch 1, the epoch count will be incorrect in resumed runs.
As far as I see, there are four potential solutions, but I'm not a 100% sure which are possible:
train_dataloader
state and just skip all the data that the training has seen in the previous run (easily done by looking atinitial_iter
and if its 0. This is however quite inefficient, especially at pretraining data scales.train_dataloader
. I am not sure what lightning fabric save/load expect, so I am not sure if this is even possible.train_dataloader
state and a "fresh"train_dataloader
, so it can continue from the first but always use the second for resetting. This feels a bit hacky and clunky however.train_dataloader
to the beginning inside CycleIterator when reseting the iterator. I am not sure how this can be done (without changing the semantics of shuffling etc).I'm happy to implement any of these if they seem like the right solution!
What operating system are you using?
Linux
LitGPT Version
The text was updated successfully, but these errors were encountered: