unscale_() has already been called on this optimizer since the last update(). #24849

paxvinci · 2023-07-17T08:01:30Z

Hi all,
I'm facing the error in the subject. I saw this problem have been already solved but I still have this. This is how I configured the parameters for the trainer.

trainer = transformers.Trainer(
    model=model, # model is decapoda-research/llama-7b-hf
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=MICRO_BATCH_SIZE, # 4 micro batch size
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS, # 16
        auto_find_batch_size=False,  # set True to avoid unscale() problem
        warmup_steps=100,
        num_train_epochs=EPOCHS, #2 epochs
        learning_rate=LEARNING_RATE, # 3e-4
        fp16=True,
        logging_steps=20,
        optim="adamw_torch",
        output_dir=NAME,
        save_total_limit=3,
        save_strategy="steps",
        save_steps=200,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

The strange behaviour is that the problem raises after the end of the first epoch.

{'loss': 0.8378, 'learning_rate': 0.00016153846153846153, 'epoch': 0.99}
 50%|███████████████████████████████████████████▌                                           | 831/1660 [15:57<6:52:51, 29.88s/it]
Traceback (most recent call last):
  File "/home/paco/dev/stambecco/train.py", line 138, in <module>
    trainer.train(resume_from_checkpoint=checkpoint_flag)
  File "/home/paco/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/paco/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1850, in _inner_training_loop
    self.accelerator.clip_grad_norm_(
  File "/home/paco/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1893, in clip_grad_norm_
    self.unscale_gradients()
  File "/home/paco/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1856, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/home/paco/.local/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 275, in unscale_
    raise RuntimeError("unscale_() has already been called on this optimizer since the last update().")
RuntimeError: unscale_() has already been called on this optimizer since the last update().
 50%|█████     | 831/1660 [16:27<16:24,  1.19s/it]

System Info

The environment is WSL
Linux 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

pip list

Package                  Version
------------------------ -------------
accelerate               0.20.3
aiohttp                  3.8.4
aiosignal                1.3.1
async-timeout            4.0.2
attrs                    23.1.0
bitsandbytes             0.39.1
blinker                  1.4
certifi                  2022.12.7
charset-normalizer       2.1.1
cmake                    3.25.0
command-not-found        0.3
cryptography             3.4.8
datasets                 2.13.0
dbus-python              1.2.18
dill                     0.3.6
distro                   1.7.0
distro-info              1.1build1
filelock                 3.9.0
frozenlist               1.3.3
fsspec                   2023.6.0
httplib2                 0.20.2
huggingface-hub          0.15.1
idna                     3.4
importlib-metadata       4.6.4
jeepney                  0.7.1
Jinja2                   3.1.2
keyring                  23.5.0
launchpadlib             1.10.16
lazr.restfulclient       0.14.4
lazr.uri                 1.0.6
lit                      15.0.7
loralib                  0.1.1
MarkupSafe               2.1.2
more-itertools           8.10.0
mpmath                   1.2.1
multidict                6.0.4
multiprocess             0.70.14
netifaces                0.11.0
networkx                 3.0
numpy                    1.24.1
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-cupti-cu11   11.7.101
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.2.10.91
nvidia-cusolver-cu11     11.4.0.1
nvidia-cusparse-cu11     11.7.4.91
nvidia-nccl-cu11         2.14.3
nvidia-nvtx-cu11         11.7.91
oauthlib                 3.2.0
packaging                23.1
pandas                   2.0.2
peft                     0.4.0.dev0
Pillow                   9.3.0
pip                      22.0.2
psutil                   5.9.5
pyarrow                  12.0.1
PyGObject                3.42.1
PyJWT                    2.3.0
pyparsing                2.4.7
python-apt               2.4.0+ubuntu1
python-dateutil          2.8.2
pytz                     2023.3
PyYAML                   5.4.1
regex                    2023.6.3
requests                 2.28.1
safetensors              0.3.1
scipy                    1.10.1
SecretStorage            3.3.1
sentencepiece            0.1.99
setuptools               59.6.0
six                      1.16.0
ssh-import-id            5.11
sympy                    1.11.1
systemd-python           234
tokenizers               0.13.3
torch                    2.0.1+cu117
torchaudio               2.0.2+cu117
torchvision              0.15.2+cu117
tqdm                     4.65.0
transformers             4.31.0.dev0
triton                   2.0.0
typing_extensions        4.4.0
tzdata                   2023.3
ubuntu-advantage-tools   8001
ufw                      0.36.1
unattended-upgrades      0.1
urllib3                  1.26.13
wadllib                  1.3.6
wheel                    0.37.1
xxhash                   3.2.0
yarl                     1.9.2
zipp                     1.0.0

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

tokenizer = LlamaTokenizer.from_pretrained(
    BASE_MODEL, add_eos_token=True
)

model = prepare_model_for_int8_training(model)

print("Preparing LoRA weights")
config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
tokenizer.pad_token_id = 0  # We want this to be different from the eos token

if DATA_PATH.endswith(".json") or DATA_PATH.endswith(".jsonl"):
        data = load_dataset("json", data_files=DATA_PATH)
else:
        data = load_dataset(DATA_PATH)

# Functions tokenize() and generate_prompt() read the json file with the following format:
# {
#    "instruction": "",
#    "input": "",
#    "output": ""
# },
data = data.shuffle().map(lambda x: tokenize(generate_prompt(x)))

model.print_trainable_parameters()
trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=MICRO_BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
        auto_find_batch_size=False,
        warmup_steps=100,
        num_train_epochs=EPOCHS,
        learning_rate=LEARNING_RATE,
        fp16=True,
        logging_steps=20,
        optim="adamw_torch",
        output_dir=NAME,
        save_total_limit=3,
        save_strategy="steps",
        save_steps=200,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False
checkpoint_folder = os.path.join(os.getcwd(), NAME)
# check if the checkpoint folder exists and is not empty
checkpoint_flag = os.path.isdir(checkpoint_folder) and len(os.listdir(checkpoint_folder))> 0
print(f"Does a checkpoint folder exists? {checkpoint_flag}\n")
trainer.train(resume_from_checkpoint=checkpoint_flag)

model.save_pretrained(f"models/{NAME}")

Expected behavior

Not raising the error and continue with the epoch #2

The text was updated successfully, but these errors were encountered:

sgugger · 2023-07-17T11:20:39Z

cc @muellerzr and @pacman100

pacman100 · 2023-07-17T14:31:27Z

Hello @paxvinci, I am running following example and unable to reproduce the issue:

Command:

cd transformers

python examples/pytorch/language-modeling/run_clm.py     --model_name_or_path gpt2     --dataset_name wikitext     --dataset_config_name wikitext-2-raw-v1     --per_device_train_batch_size 8     --per_device_eval_batch_size 8     --do_train     --do_eval     --output_dir /tmp/test-clm     --gradient_accumulation_steps 6 --overwrite_output_dir

output logs

[INFO|trainer.py:1686] 2023-07-17 15:47:49,578 >> ***** Running training *****
[INFO|trainer.py:1687] 2023-07-17 15:47:49,578 >>   Num examples = 2,318
[INFO|trainer.py:1688] 2023-07-17 15:47:49,578 >>   Num Epochs = 3
[INFO|trainer.py:1689] 2023-07-17 15:47:49,578 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1692] 2023-07-17 15:47:49,578 >>   Total train batch size (w. parallel, distributed & accumulation) = 48
[INFO|trainer.py:1693] 2023-07-17 15:47:49,578 >>   Gradient Accumulation steps = 6
[INFO|trainer.py:1694] 2023-07-17 15:47:49,578 >>   Total optimization steps = 144
[INFO|trainer.py:1695] 2023-07-17 15:47:49,578 >>   Number of trainable parameters = 124,439,808
[INFO|integrations.py:716] 2023-07-17 15:47:49,579 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: smangrul. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.5
wandb: Run data is saved locally in /home/sourab/transformers/examples/pytorch/language-modeling/wandb/run-20230717_154750-20eekm9c
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run usual-dragon-27
wandb: ⭐️ View project at https://wandb.ai/smangrul/huggingface
wandb: 🚀 View run at https://wandb.ai/smangrul/huggingface/runs/20eekm9c
100%|█████████████████████████████████████████████████████████████| 144/144 [09:01<00:00,  3.76s/it][INFO|trainer.py:1934] 2023-07-17 15:56:56,376 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 546.7981, 'train_samples_per_second': 12.718, 'train_steps_per_second': 0.263, 'train_loss': 3.233305189344618, 'epoch': 2.98}
100%|█████████████████████████████████████████████████████████████| 144/144 [09:01<00:00,  3.76s/it]
[INFO|trainer.py:2807] 2023-07-17 15:56:56,378 >> Saving model checkpoint to /tmp/test-clm
[INFO|configuration_utils.py:458] 2023-07-17 15:56:56,378 >> Configuration saved in /tmp/test-clm/config.json
[INFO|configuration_utils.py:375] 2023-07-17 15:56:56,379 >> Configuration saved in /tmp/test-clm/generation_config.json
[INFO|modeling_utils.py:1851] 2023-07-17 15:56:57,203 >> Model weights saved in /tmp/test-clm/pytorch_model.bin
[INFO|tokenization_utils_base.py:2214] 2023-07-17 15:56:57,203 >> tokenizer config file saved in /tmp/test-clm/tokenizer_config.json
[INFO|tokenization_utils_base.py:2221] 2023-07-17 15:56:57,204 >> Special tokens file saved in /tmp/test-clm/special_tokens_map.json
***** train metrics *****
  epoch                    =       2.98
  train_loss               =     3.2333
  train_runtime            = 0:09:06.79
  train_samples            =       2318
  train_samples_per_second =     12.718
  train_steps_per_second   =      0.263
07/17/2023 15:56:57 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:3081] 2023-07-17 15:56:57,284 >> ***** Running Evaluation *****
[INFO|trainer.py:3083] 2023-07-17 15:56:57,284 >>   Num examples = 240
[INFO|trainer.py:3086] 2023-07-17 15:56:57,284 >>   Batch size = 8
100%|███████████████████████████████████████████████████████████████| 30/30 [00:07<00:00,  4.20it/s]
***** eval metrics *****
  epoch                   =       2.98
  eval_accuracy           =     0.4212
  eval_loss               =     3.0811
  eval_runtime            = 0:00:07.36
  eval_samples            =        240
  eval_samples_per_second =     32.588
  eval_steps_per_second   =      4.074
  perplexity              =    21.7826
wandb: Waiting for W&B process to finish... (success).
wandb: \ 0.015 MB of 0.015 MB uploaded (0.000 MB deduped)
wandb: Run history:
wandb:                  eval/accuracy ▁
wandb:                      eval/loss ▁
wandb:                   eval/runtime ▁
wandb:        eval/samples_per_second ▁
wandb:          eval/steps_per_second ▁
wandb:                    train/epoch ▁▁
wandb:              train/global_step ▁▁
wandb:               train/total_flos ▁
wandb:               train/train_loss ▁
wandb:            train/train_runtime ▁
wandb: train/train_samples_per_second ▁
wandb:   train/train_steps_per_second ▁
wandb: 
wandb: Run summary:
wandb:                  eval/accuracy 0.42115
wandb:                      eval/loss 3.08111
wandb:                   eval/runtime 7.3646
wandb:        eval/samples_per_second 32.588
wandb:          eval/steps_per_second 4.074
wandb:                    train/epoch 2.98
wandb:              train/global_step 144
wandb:               train/total_flos 3610010714112000.0
wandb:               train/train_loss 3.23331
wandb:            train/train_runtime 546.7981
wandb: train/train_samples_per_second 12.718
wandb:   train/train_steps_per_second 0.263
wandb: 
wandb: 🚀 View run usual-dragon-27 at: https://wandb.ai/smangrul/huggingface/runs/20eekm9c
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)

Using latest transformers and accelerate main branch

pacman100 · 2023-07-17T14:37:33Z

Please share a minimal reproducer so that we can deep dive if the issue still persists

paxvinci · 2023-07-17T15:15:28Z

I cannot share the json file due to confidential data. I reinstalled the last transformers and I restarted the train session. If I'll face again the error I'll send an update.

paxvinci · 2023-07-26T19:32:49Z

Update: I downloaded the last version of the transformers via pip and I started again the training. After a couple of problems due to BSOD I restarted the training from checkpoints but I still receive "Can't find a valid checkpoint at" . There is a warning after the creation of the model

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
LLAMA Tokenizer created LlamaTokenizer(name_or_path='decapoda-research/llama-7b-hf', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True)}, clean_up_tokenization_spaces=False)

I tried to chage from LlamaTokenizer to LLaMATokenizer but the class does not exists.

github-actions · 2023-08-20T15:01:54Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

pacman100 mentioned this issue Jul 17, 2023

Again: RuntimeError: unscale_() has already been called on this optimizer since the last update() #24840

Closed

4 tasks

pacman100 added the solved label Jul 21, 2023

github-actions bot closed this as completed Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unscale_() has already been called on this optimizer since the last update(). #24849

unscale_() has already been called on this optimizer since the last update(). #24849

paxvinci commented Jul 17, 2023 •

edited

Loading

sgugger commented Jul 17, 2023

pacman100 commented Jul 17, 2023

pacman100 commented Jul 17, 2023

paxvinci commented Jul 17, 2023

paxvinci commented Jul 26, 2023

github-actions bot commented Aug 20, 2023

unscale_() has already been called on this optimizer since the last update(). #24849

unscale_() has already been called on this optimizer since the last update(). #24849

Comments

paxvinci commented Jul 17, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

sgugger commented Jul 17, 2023

pacman100 commented Jul 17, 2023

pacman100 commented Jul 17, 2023

paxvinci commented Jul 17, 2023

paxvinci commented Jul 26, 2023

github-actions bot commented Aug 20, 2023

paxvinci commented Jul 17, 2023 •

edited

Loading