Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Loading bug in pretrain on resume over multiple epochs #1712

Open
fdalvi opened this issue Sep 7, 2024 · 0 comments
Open

Data Loading bug in pretrain on resume over multiple epochs #1712

fdalvi opened this issue Sep 7, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@fdalvi
Copy link

fdalvi commented Sep 7, 2024

Bug description

The pretrain command currently saves and loads (on resume) the train_dataloader as part of model state. The issue is that CycleIterator is initialized after reloading the train_dataloader:

train_iterator = CycleIterator(train_dataloader)

and it takes the current state as the "starting point" for all future epochs. What this means in practice is lets say an experiment crashed 90% of the way through the dataset, on resume the first epoch will run as expected, but the subsequent epochs will only go through the last 10% of the data repeatedly. The other smaller inconvenience is that CycleIterator is how the current code tracks epochs, and since this state is not saved, if the crash happens anywhere other than Epoch 1, the epoch count will be incorrect in resumed runs.

As far as I see, there are four potential solutions, but I'm not a 100% sure which are possible:

  1. Do the same thing we do in finetune_full, which is not save the train_dataloader state and just skip all the data that the training has seen in the previous run (easily done by looking at initial_iter and if its 0. This is however quite inefficient, especially at pretraining data scales.
  2. Save the CycleIterator state instead of the train_dataloader. I am not sure what lightning fabric save/load expect, so I am not sure if this is even possible.
  3. Supply CycleIterator with both the saved train_dataloader state and a "fresh" train_dataloader, so it can continue from the first but always use the second for resetting. This feels a bit hacky and clunky however.
  4. Somehow "reset" the train_dataloader to the beginning inside CycleIterator when reseting the iterator. I am not sure how this can be done (without changing the semantics of shuffling etc).

I'm happy to implement any of these if they seem like the right solution!

What operating system are you using?

Linux

LitGPT Version

Version: 0.4.10
@fdalvi fdalvi added the bug Something isn't working label Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant