ZeroDivisionError when training on a single batch of data #27758

tleyden · 2023-11-29T13:56:45Z

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.35.2
Platform: Linux-5.15.0-69-generic-x86_64-with-glibc2.31
Python version: 3.10.8
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.1
Accelerate version: 0.24.1
Accelerate config: not found
PyTorch version (GPU?): 2.0.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: not that I'm aware, there's only a single GPU and single machine

Who can help?

@muellerzr and @pacman100 (tagging for trainer)

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Overview

With very small datasets that end up being a single batch, this code in transformers/trainer.py#L1968-L1969 from:

# add remaining tr_loss
self._total_loss_scalar += tr_loss.item()
train_loss = self._total_loss_scalar / self.state.global_step

will throw a ZeroDivisionError.

Since I can't always control the data that is being uploaded to the service I'm working on, this is problematic because users will receive a cryptic error that makes it appear that the service is broken.

Reproduction

Train on a dataset with a single batch of data.

Logger info:

Currently training with a batch size of: 1
***** Running training *****
  Num examples = 37
  Num Epochs = 3
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 32
  Total optimization steps = 3
  Number of trainable parameters = 109,051,904

Error stack trace:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.10/site-packages/dalm/training/generator_only/trainer.py", line 301, in <module>
    main()
  File "/opt/conda/lib/python3.10/site-packages/dalm/training/generator_only/trainer.py", line 264, in main
    train_generator(
  File "/opt/conda/lib/python3.10/site-packages/dalm/training/generator_only/trainer.py", line 255, in train_generator
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 280, in train
    output = super().train(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1969, in _inner_training_loop
    train_loss = self._total_loss_scalar / self.state.global_step
ZeroDivisionError: float division by zero

Expected behavior

Instead of throwing a cryptic ZeroDivisionError, it should at least return a more user friendly error like "Training with a single batch of data is not supported. Try again with a larger dataset"

But it would be much better if it just handled it more gracefully and approximated the loss, maybe by adding a small constant to the denominator or making the global_step to be 1-based instead of 0-based.

The following workaround in transformers/trainer.py avoids the error.

Update:

        # add remaining tr_loss
        self._total_loss_scalar += tr_loss.item()
        train_loss = self._total_loss_scalar / self.state.global_step

to:

        gstep = self.state.global_step
        if gstep <= 0:
            gstep = 1
        train_loss = self._total_loss_scalar / gstep

This avoids the error, though it gives the incorrect loss value.

The text was updated successfully, but these errors were encountered:

pacman100 · 2023-11-29T14:02:04Z

Thank you @tleyden for raising this issue. As you already have the fix, it would be great if you could open a PR.

tleyden · 2023-11-29T14:04:35Z

Hey @pacman100 happy to open an PR.

That was just a "workaround experiment" though. I think the right way to fix it might be to look at either of the following approaches:

Adding a small constant to the denominator
Making the global_step to be 1-based instead of 0-based
Probably others .. I will do some research

github-actions · 2023-12-30T08:03:03Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker · 2024-01-08T08:54:52Z

@tleyden feel free to link your PR! 🤗

tleyden · 2024-01-08T09:41:50Z

Hey @ArthurZucker thanks for re-opening, I do think this is worth fixing. Here's the hack I did to fix it:

tleyden@550ba49

but I think that's more of a workaround than an actual root-cause fix. If someone already more familiar with the codebase could give some guidance on the best approach to fix, I can put together a PR.

tleyden · 2024-01-29T13:46:03Z

@ArthurZucker @pacman100 PTAL at #28756

ParulGupta16 · 2024-02-03T10:53:54Z

Thank you for fixing @tleyden . I found another workaround for my case, where I am learning to use SFTTrainer with PEFT.
Increasing the epochs to value > 1, in my scenario I used epoch=2 in TrainingArguments also resolves the problem.

tleyden · 2024-02-03T10:56:34Z

@ParulGupta16 Thanks for posting, that is a good hint about how to fix the underlying bug. My PR in #28756 is more of a workaround.

…ace#28756)

tleyden added a commit to tleyden/transformers that referenced this issue Nov 29, 2023

Workaround for huggingface#27758 to avoid ZeroDivisionError

550ba49

github-actions bot closed this as completed Jan 7, 2024

ArthurZucker reopened this Jan 8, 2024

ArthurZucker added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jan 8, 2024

ArthurZucker added Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! and removed WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress labels Jan 8, 2024

tleyden added a commit to tleyden/transformers that referenced this issue Jan 29, 2024

Workaround for huggingface#27758 to avoid ZeroDivisionError

5fd8ca8

tleyden mentioned this issue Jan 29, 2024

Workaround for #27758 to avoid ZeroDivisionError #28756

Merged

5 tasks

tleyden added a commit to tleyden/transformers that referenced this issue Feb 29, 2024

Workaround for huggingface#27758 to avoid ZeroDivisionError

0f6a59f

ArthurZucker closed this as completed in #28756 Mar 4, 2024

ArthurZucker pushed a commit that referenced this issue Mar 4, 2024

Workaround for #27758 to avoid ZeroDivisionError (#28756)

c38a122

damithsenanayake pushed a commit to damithsenanayake/transformers that referenced this issue Mar 7, 2024

Workaround for huggingface#27758 to avoid ZeroDivisionError (huggingf…

ae2acb7

…ace#28756)

itazap pushed a commit that referenced this issue May 14, 2024

Workaround for #27758 to avoid ZeroDivisionError (#28756)

42e5fba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeroDivisionError when training on a single batch of data #27758

ZeroDivisionError when training on a single batch of data #27758

tleyden commented Nov 29, 2023 •

edited

Loading

pacman100 commented Nov 29, 2023

tleyden commented Nov 29, 2023

github-actions bot commented Dec 30, 2023

ArthurZucker commented Jan 8, 2024

tleyden commented Jan 8, 2024

tleyden commented Jan 29, 2024

ParulGupta16 commented Feb 3, 2024

tleyden commented Feb 3, 2024

ZeroDivisionError when training on a single batch of data #27758

ZeroDivisionError when training on a single batch of data #27758

Comments

tleyden commented Nov 29, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Overview

Reproduction

Expected behavior

pacman100 commented Nov 29, 2023

tleyden commented Nov 29, 2023

github-actions bot commented Dec 30, 2023

ArthurZucker commented Jan 8, 2024

tleyden commented Jan 8, 2024

tleyden commented Jan 29, 2024

ParulGupta16 commented Feb 3, 2024

tleyden commented Feb 3, 2024

tleyden commented Nov 29, 2023 •

edited

Loading