Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unscale_() has already been called on this optimizer since the last update(). #24849

Closed
4 tasks
paxvinci opened this issue Jul 17, 2023 · 6 comments
Closed
4 tasks
Labels

Comments

@paxvinci
Copy link

paxvinci commented Jul 17, 2023

Hi all,
I'm facing the error in the subject. I saw this problem have been already solved but I still have this. This is how I configured the parameters for the trainer.

trainer = transformers.Trainer(
    model=model, # model is decapoda-research/llama-7b-hf
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=MICRO_BATCH_SIZE, # 4 micro batch size
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS, # 16
        auto_find_batch_size=False,  # set True to avoid unscale() problem
        warmup_steps=100,
        num_train_epochs=EPOCHS, #2 epochs
        learning_rate=LEARNING_RATE, # 3e-4
        fp16=True,
        logging_steps=20,
        optim="adamw_torch",
        output_dir=NAME,
        save_total_limit=3,
        save_strategy="steps",
        save_steps=200,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

The strange behaviour is that the problem raises after the end of the first epoch.

{'loss': 0.8378, 'learning_rate': 0.00016153846153846153, 'epoch': 0.99}
 50%|███████████████████████████████████████████▌                                           | 831/1660 [15:57<6:52:51, 29.88s/it]
Traceback (most recent call last):
  File "/home/paco/dev/stambecco/train.py", line 138, in <module>
    trainer.train(resume_from_checkpoint=checkpoint_flag)
  File "/home/paco/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/paco/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1850, in _inner_training_loop
    self.accelerator.clip_grad_norm_(
  File "/home/paco/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1893, in clip_grad_norm_
    self.unscale_gradients()
  File "/home/paco/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1856, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/home/paco/.local/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 275, in unscale_
    raise RuntimeError("unscale_() has already been called on this optimizer since the last update().")
RuntimeError: unscale_() has already been called on this optimizer since the last update().
 50%|█████     | 831/1660 [16:27<16:24,  1.19s/it]

System Info

The environment is WSL
Linux 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

pip list

Package                  Version
------------------------ -------------
accelerate               0.20.3
aiohttp                  3.8.4
aiosignal                1.3.1
async-timeout            4.0.2
attrs                    23.1.0
bitsandbytes             0.39.1
blinker                  1.4
certifi                  2022.12.7
charset-normalizer       2.1.1
cmake                    3.25.0
command-not-found        0.3
cryptography             3.4.8
datasets                 2.13.0
dbus-python              1.2.18
dill                     0.3.6
distro                   1.7.0
distro-info              1.1build1
filelock                 3.9.0
frozenlist               1.3.3
fsspec                   2023.6.0
httplib2                 0.20.2
huggingface-hub          0.15.1
idna                     3.4
importlib-metadata       4.6.4
jeepney                  0.7.1
Jinja2                   3.1.2
keyring                  23.5.0
launchpadlib             1.10.16
lazr.restfulclient       0.14.4
lazr.uri                 1.0.6
lit                      15.0.7
loralib                  0.1.1
MarkupSafe               2.1.2
more-itertools           8.10.0
mpmath                   1.2.1
multidict                6.0.4
multiprocess             0.70.14
netifaces                0.11.0
networkx                 3.0
numpy                    1.24.1
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-cupti-cu11   11.7.101
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.2.10.91
nvidia-cusolver-cu11     11.4.0.1
nvidia-cusparse-cu11     11.7.4.91
nvidia-nccl-cu11         2.14.3
nvidia-nvtx-cu11         11.7.91
oauthlib                 3.2.0
packaging                23.1
pandas                   2.0.2
peft                     0.4.0.dev0
Pillow                   9.3.0
pip                      22.0.2
psutil                   5.9.5
pyarrow                  12.0.1
PyGObject                3.42.1
PyJWT                    2.3.0
pyparsing                2.4.7
python-apt               2.4.0+ubuntu1
python-dateutil          2.8.2
pytz                     2023.3
PyYAML                   5.4.1
regex                    2023.6.3
requests                 2.28.1
safetensors              0.3.1
scipy                    1.10.1
SecretStorage            3.3.1
sentencepiece            0.1.99
setuptools               59.6.0
six                      1.16.0
ssh-import-id            5.11
sympy                    1.11.1
systemd-python           234
tokenizers               0.13.3
torch                    2.0.1+cu117
torchaudio               2.0.2+cu117
torchvision              0.15.2+cu117
tqdm                     4.65.0
transformers             4.31.0.dev0
triton                   2.0.0
typing_extensions        4.4.0
tzdata                   2023.3
ubuntu-advantage-tools   8001
ufw                      0.36.1
unattended-upgrades      0.1
urllib3                  1.26.13
wadllib                  1.3.6
wheel                    0.37.1
xxhash                   3.2.0
yarl                     1.9.2
zipp                     1.0.0

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

tokenizer = LlamaTokenizer.from_pretrained(
    BASE_MODEL, add_eos_token=True
)

model = prepare_model_for_int8_training(model)

print("Preparing LoRA weights")
config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
tokenizer.pad_token_id = 0  # We want this to be different from the eos token

if DATA_PATH.endswith(".json") or DATA_PATH.endswith(".jsonl"):
        data = load_dataset("json", data_files=DATA_PATH)
else:
        data = load_dataset(DATA_PATH)

# Functions tokenize() and generate_prompt() read the json file with the following format:
# {
#    "instruction": "",
#    "input": "",
#    "output": ""
# },
data = data.shuffle().map(lambda x: tokenize(generate_prompt(x)))

model.print_trainable_parameters()
trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=MICRO_BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
        auto_find_batch_size=False,
        warmup_steps=100,
        num_train_epochs=EPOCHS,
        learning_rate=LEARNING_RATE,
        fp16=True,
        logging_steps=20,
        optim="adamw_torch",
        output_dir=NAME,
        save_total_limit=3,
        save_strategy="steps",
        save_steps=200,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False
checkpoint_folder = os.path.join(os.getcwd(), NAME)
# check if the checkpoint folder exists and is not empty
checkpoint_flag = os.path.isdir(checkpoint_folder) and len(os.listdir(checkpoint_folder))> 0
print(f"Does a checkpoint folder exists? {checkpoint_flag}\n")
trainer.train(resume_from_checkpoint=checkpoint_flag)

model.save_pretrained(f"models/{NAME}")

Expected behavior

Not raising the error and continue with the epoch #2

@sgugger
Copy link
Collaborator

sgugger commented Jul 17, 2023

cc @muellerzr and @pacman100

@pacman100
Copy link
Contributor

Hello @paxvinci, I am running following example and unable to reproduce the issue:

Command:

cd transformers

python examples/pytorch/language-modeling/run_clm.py     --model_name_or_path gpt2     --dataset_name wikitext     --dataset_config_name wikitext-2-raw-v1     --per_device_train_batch_size 8     --per_device_eval_batch_size 8     --do_train     --do_eval     --output_dir /tmp/test-clm     --gradient_accumulation_steps 6 --overwrite_output_dir

output logs

[INFO|trainer.py:1686] 2023-07-17 15:47:49,578 >> ***** Running training *****
[INFO|trainer.py:1687] 2023-07-17 15:47:49,578 >>   Num examples = 2,318
[INFO|trainer.py:1688] 2023-07-17 15:47:49,578 >>   Num Epochs = 3
[INFO|trainer.py:1689] 2023-07-17 15:47:49,578 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1692] 2023-07-17 15:47:49,578 >>   Total train batch size (w. parallel, distributed & accumulation) = 48
[INFO|trainer.py:1693] 2023-07-17 15:47:49,578 >>   Gradient Accumulation steps = 6
[INFO|trainer.py:1694] 2023-07-17 15:47:49,578 >>   Total optimization steps = 144
[INFO|trainer.py:1695] 2023-07-17 15:47:49,578 >>   Number of trainable parameters = 124,439,808
[INFO|integrations.py:716] 2023-07-17 15:47:49,579 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: smangrul. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.5
wandb: Run data is saved locally in /home/sourab/transformers/examples/pytorch/language-modeling/wandb/run-20230717_154750-20eekm9c
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run usual-dragon-27
wandb: ⭐️ View project at https://wandb.ai/smangrul/huggingface
wandb: 🚀 View run at https://wandb.ai/smangrul/huggingface/runs/20eekm9c
100%|█████████████████████████████████████████████████████████████| 144/144 [09:01<00:00,  3.76s/it][INFO|trainer.py:1934] 2023-07-17 15:56:56,376 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 546.7981, 'train_samples_per_second': 12.718, 'train_steps_per_second': 0.263, 'train_loss': 3.233305189344618, 'epoch': 2.98}
100%|█████████████████████████████████████████████████████████████| 144/144 [09:01<00:00,  3.76s/it]
[INFO|trainer.py:2807] 2023-07-17 15:56:56,378 >> Saving model checkpoint to /tmp/test-clm
[INFO|configuration_utils.py:458] 2023-07-17 15:56:56,378 >> Configuration saved in /tmp/test-clm/config.json
[INFO|configuration_utils.py:375] 2023-07-17 15:56:56,379 >> Configuration saved in /tmp/test-clm/generation_config.json
[INFO|modeling_utils.py:1851] 2023-07-17 15:56:57,203 >> Model weights saved in /tmp/test-clm/pytorch_model.bin
[INFO|tokenization_utils_base.py:2214] 2023-07-17 15:56:57,203 >> tokenizer config file saved in /tmp/test-clm/tokenizer_config.json
[INFO|tokenization_utils_base.py:2221] 2023-07-17 15:56:57,204 >> Special tokens file saved in /tmp/test-clm/special_tokens_map.json
***** train metrics *****
  epoch                    =       2.98
  train_loss               =     3.2333
  train_runtime            = 0:09:06.79
  train_samples            =       2318
  train_samples_per_second =     12.718
  train_steps_per_second   =      0.263
07/17/2023 15:56:57 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:3081] 2023-07-17 15:56:57,284 >> ***** Running Evaluation *****
[INFO|trainer.py:3083] 2023-07-17 15:56:57,284 >>   Num examples = 240
[INFO|trainer.py:3086] 2023-07-17 15:56:57,284 >>   Batch size = 8
100%|███████████████████████████████████████████████████████████████| 30/30 [00:07<00:00,  4.20it/s]
***** eval metrics *****
  epoch                   =       2.98
  eval_accuracy           =     0.4212
  eval_loss               =     3.0811
  eval_runtime            = 0:00:07.36
  eval_samples            =        240
  eval_samples_per_second =     32.588
  eval_steps_per_second   =      4.074
  perplexity              =    21.7826
wandb: Waiting for W&B process to finish... (success).
wandb: \ 0.015 MB of 0.015 MB uploaded (0.000 MB deduped)
wandb: Run history:
wandb:                  eval/accuracy ▁
wandb:                      eval/loss ▁
wandb:                   eval/runtime ▁
wandb:        eval/samples_per_second ▁
wandb:          eval/steps_per_second ▁
wandb:                    train/epoch ▁▁
wandb:              train/global_step ▁▁
wandb:               train/total_flos ▁
wandb:               train/train_loss ▁
wandb:            train/train_runtime ▁
wandb: train/train_samples_per_second ▁
wandb:   train/train_steps_per_second ▁
wandb: 
wandb: Run summary:
wandb:                  eval/accuracy 0.42115
wandb:                      eval/loss 3.08111
wandb:                   eval/runtime 7.3646
wandb:        eval/samples_per_second 32.588
wandb:          eval/steps_per_second 4.074
wandb:                    train/epoch 2.98
wandb:              train/global_step 144
wandb:               train/total_flos 3610010714112000.0
wandb:               train/train_loss 3.23331
wandb:            train/train_runtime 546.7981
wandb: train/train_samples_per_second 12.718
wandb:   train/train_steps_per_second 0.263
wandb: 
wandb: 🚀 View run usual-dragon-27 at: https://wandb.ai/smangrul/huggingface/runs/20eekm9c
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)

Using latest transformers and accelerate main branch

@pacman100
Copy link
Contributor

Please share a minimal reproducer so that we can deep dive if the issue still persists

@paxvinci
Copy link
Author

I cannot share the json file due to confidential data. I reinstalled the last transformers and I restarted the train session. If I'll face again the error I'll send an update.

@paxvinci
Copy link
Author

Update: I downloaded the last version of the transformers via pip and I started again the training. After a couple of problems due to BSOD I restarted the training from checkpoints but I still receive "Can't find a valid checkpoint at" . There is a warning after the creation of the model

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
LLAMA Tokenizer created LlamaTokenizer(name_or_path='decapoda-research/llama-7b-hf', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True)}, clean_up_tokenization_spaces=False)

I tried to chage from LlamaTokenizer to LLaMATokenizer but the class does not exists.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants