-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZeroDivisionError when training on a single batch of data #27758
Comments
Thank you @tleyden for raising this issue. As you already have the fix, it would be great if you could open a PR. |
Hey @pacman100 happy to open an PR. That was just a "workaround experiment" though. I think the right way to fix it might be to look at either of the following approaches:
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@tleyden feel free to link your PR! 🤗 |
Hey @ArthurZucker thanks for re-opening, I do think this is worth fixing. Here's the hack I did to fix it: but I think that's more of a workaround than an actual root-cause fix. If someone already more familiar with the codebase could give some guidance on the best approach to fix, I can put together a PR. |
@ArthurZucker @pacman100 PTAL at #28756 |
Thank you for fixing @tleyden . I found another workaround for my case, where I am learning to use SFTTrainer with PEFT. |
@ParulGupta16 Thanks for posting, that is a good hint about how to fix the underlying bug. My PR in #28756 is more of a workaround. |
System Info
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
transformers
version: 4.35.2Who can help?
@muellerzr and @pacman100 (tagging for trainer)
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Overview
With very small datasets that end up being a single batch, this code in transformers/trainer.py#L1968-L1969 from:
will throw a
ZeroDivisionError
.Since I can't always control the data that is being uploaded to the service I'm working on, this is problematic because users will receive a cryptic error that makes it appear that the service is broken.
Reproduction
Train on a dataset with a single batch of data.
Logger info:
Error stack trace:
Expected behavior
Instead of throwing a cryptic
ZeroDivisionError
, it should at least return a more user friendly error like "Training with a single batch of data is not supported. Try again with a larger dataset"But it would be much better if it just handled it more gracefully and approximated the loss, maybe by adding a small constant to the denominator or making the
global_step
to be 1-based instead of 0-based.The following workaround in transformers/trainer.py avoids the error.
Update:
to:
This avoids the error, though it gives the incorrect loss value.
The text was updated successfully, but these errors were encountered: