Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZeroDivisionError when training on a single batch of data #27758

Closed
2 of 4 tasks
tleyden opened this issue Nov 29, 2023 · 8 comments · Fixed by #28756
Closed
2 of 4 tasks

ZeroDivisionError when training on a single batch of data #27758

tleyden opened this issue Nov 29, 2023 · 8 comments · Fixed by #28756
Labels
Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!

Comments

@tleyden
Copy link
Contributor

tleyden commented Nov 29, 2023

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

  • transformers version: 4.35.2
  • Platform: Linux-5.15.0-69-generic-x86_64-with-glibc2.31
  • Python version: 3.10.8
  • Huggingface_hub version: 0.19.4
  • Safetensors version: 0.4.1
  • Accelerate version: 0.24.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: not that I'm aware, there's only a single GPU and single machine

Who can help?

@muellerzr and @pacman100 (tagging for trainer)

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Overview

With very small datasets that end up being a single batch, this code in transformers/trainer.py#L1968-L1969 from:

# add remaining tr_loss
self._total_loss_scalar += tr_loss.item()
train_loss = self._total_loss_scalar / self.state.global_step

will throw a ZeroDivisionError.

Since I can't always control the data that is being uploaded to the service I'm working on, this is problematic because users will receive a cryptic error that makes it appear that the service is broken.

Reproduction

Train on a dataset with a single batch of data.

Logger info:

Currently training with a batch size of: 1
***** Running training *****
  Num examples = 37
  Num Epochs = 3
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 32
  Total optimization steps = 3
  Number of trainable parameters = 109,051,904

Error stack trace:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.10/site-packages/dalm/training/generator_only/trainer.py", line 301, in <module>
    main()
  File "/opt/conda/lib/python3.10/site-packages/dalm/training/generator_only/trainer.py", line 264, in main
    train_generator(
  File "/opt/conda/lib/python3.10/site-packages/dalm/training/generator_only/trainer.py", line 255, in train_generator
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 280, in train
    output = super().train(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1969, in _inner_training_loop
    train_loss = self._total_loss_scalar / self.state.global_step
ZeroDivisionError: float division by zero

Expected behavior

Instead of throwing a cryptic ZeroDivisionError, it should at least return a more user friendly error like "Training with a single batch of data is not supported. Try again with a larger dataset"

But it would be much better if it just handled it more gracefully and approximated the loss, maybe by adding a small constant to the denominator or making the global_step to be 1-based instead of 0-based.

The following workaround in transformers/trainer.py avoids the error.

Update:

        # add remaining tr_loss
        self._total_loss_scalar += tr_loss.item()
        train_loss = self._total_loss_scalar / self.state.global_step

to:

        gstep = self.state.global_step
        if gstep <= 0:
            gstep = 1
        train_loss = self._total_loss_scalar / gstep

This avoids the error, though it gives the incorrect loss value.

@pacman100
Copy link
Contributor

Thank you @tleyden for raising this issue. As you already have the fix, it would be great if you could open a PR.

@tleyden
Copy link
Contributor Author

tleyden commented Nov 29, 2023

Hey @pacman100 happy to open an PR.

That was just a "workaround experiment" though. I think the right way to fix it might be to look at either of the following approaches:

  1. Adding a small constant to the denominator
  2. Making the global_step to be 1-based instead of 0-based
  3. Probably others .. I will do some research

tleyden added a commit to tleyden/transformers that referenced this issue Nov 29, 2023
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Jan 7, 2024
@ArthurZucker ArthurZucker reopened this Jan 8, 2024
@ArthurZucker ArthurZucker added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jan 8, 2024
@ArthurZucker
Copy link
Collaborator

@tleyden feel free to link your PR! 🤗

@tleyden
Copy link
Contributor Author

tleyden commented Jan 8, 2024

Hey @ArthurZucker thanks for re-opening, I do think this is worth fixing. Here's the hack I did to fix it:

tleyden@550ba49

but I think that's more of a workaround than an actual root-cause fix. If someone already more familiar with the codebase could give some guidance on the best approach to fix, I can put together a PR.

@ArthurZucker ArthurZucker added Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! and removed WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress labels Jan 8, 2024
tleyden added a commit to tleyden/transformers that referenced this issue Jan 29, 2024
@tleyden
Copy link
Contributor Author

tleyden commented Jan 29, 2024

@ArthurZucker @pacman100 PTAL at #28756

@ParulGupta16
Copy link

Thank you for fixing @tleyden . I found another workaround for my case, where I am learning to use SFTTrainer with PEFT.
Increasing the epochs to value > 1, in my scenario I used epoch=2 in TrainingArguments also resolves the problem.

@tleyden
Copy link
Contributor Author

tleyden commented Feb 3, 2024

@ParulGupta16 Thanks for posting, that is a good hint about how to fix the underlying bug. My PR in #28756 is more of a workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants