Data Loading bug in `pretrain` on resume over multiple epochs #1712

fdalvi · 2024-09-07T10:47:42Z

Bug description

The pretrain command currently saves and loads (on resume) the train_dataloader as part of model state. The issue is that CycleIterator is initialized after reloading the train_dataloader:

litgpt/litgpt/pretrain.py

Line 282 in 1d37f9a

train_iterator = CycleIterator(train_dataloader)

and it takes the current state as the "starting point" for all future epochs. What this means in practice is lets say an experiment crashed 90% of the way through the dataset, on resume the first epoch will run as expected, but the subsequent epochs will only go through the last 10% of the data repeatedly. The other smaller inconvenience is that CycleIterator is how the current code tracks epochs, and since this state is not saved, if the crash happens anywhere other than Epoch 1, the epoch count will be incorrect in resumed runs.

As far as I see, there are four potential solutions, but I'm not a 100% sure which are possible:

Do the same thing we do in finetune_full, which is not save the train_dataloader state and just skip all the data that the training has seen in the previous run (easily done by looking at initial_iter and if its 0. This is however quite inefficient, especially at pretraining data scales.
Save the CycleIterator state instead of the train_dataloader. I am not sure what lightning fabric save/load expect, so I am not sure if this is even possible.
Supply CycleIterator with both the saved train_dataloader state and a "fresh" train_dataloader, so it can continue from the first but always use the second for resetting. This feels a bit hacky and clunky however.
Somehow "reset" the train_dataloader to the beginning inside CycleIterator when reseting the iterator. I am not sure how this can be done (without changing the semantics of shuffling etc).

I'm happy to implement any of these if they seem like the right solution!

What operating system are you using?

Linux

LitGPT Version

Version: 0.4.10

The text was updated successfully, but these errors were encountered:

fdalvi added the bug Something isn't working label Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Loading bug in `pretrain` on resume over multiple epochs #1712

Data Loading bug in `pretrain` on resume over multiple epochs #1712

fdalvi commented Sep 7, 2024

Data Loading bug in pretrain on resume over multiple epochs #1712

Data Loading bug in pretrain on resume over multiple epochs #1712

Comments

fdalvi commented Sep 7, 2024

Bug description

What operating system are you using?

LitGPT Version

Data Loading bug in `pretrain` on resume over multiple epochs #1712

Data Loading bug in `pretrain` on resume over multiple epochs #1712