-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make dataloader stateful? #291
Comments
Thanks for creating this issue! In fact we recently started working on it, in #279. |
Tested the branch
It is weird but all ranks are receiving the same Not quite sure about what happen here but this might be the reason,
|
@XinDongol Thanks for trying out! The work is still in progress and not ready. |
I was wondering whether you find the root cause of all ranks receiving the same state_dict of dataloader? |
Yes. Currently if it is not DTensor, only rank 0's value is saved. After the fix, we'd like to be able to save values across all ranks if the keys of state_dict are different per each rank. |
@tianyu-l @gokulavasan |
@XinDongol The reason we didn't prioritize supporting For these reasons, we think it's better not to introduce the additional complexity. However, things may change if we are going to support multi-model training, as the loading time of image / video could be much longer. Happy to hear your thoughts on it. |
I agree. For very large model, it may be the case. Torchtitan is currently doing on-the-fly tokenization. Increasing num_workers makes on-the-fly tokenization even faster because reading texts is more IO efficient than reading tokens. |
@XinDongol Appreciated your feedback a lot!
I wonder what |
Resuming from checkpoint uses the same dataloader from begining currently. This may lead to issues for training.
We may need to resume dataloader from saved state to skip sampled data.
The text was updated successfully, but these errors were encountered: