Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gemma 2 forward pass broken #5802

Closed
1 task done
Sehyo opened this issue Oct 23, 2024 · 2 comments
Closed
1 task done

Gemma 2 forward pass broken #5802

Sehyo opened this issue Oct 23, 2024 · 2 comments
Labels
solved This problem has been already solved

Comments

@Sehyo
Copy link

Sehyo commented Oct 23, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

Tried to KTO finetune with Gemma 2 Base models, it results in error for "srcIndex < srcSelectDimSize" assertion fail.
Other models work fine for finetuning.
llamafactory version is 0.9.1.dev0
Transformers version is 4.45.2

Please see:

../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1044,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSizefailed. Traceback (most recent call last): File "/home/ubuntu/factory/LLaMA-Factory/venv/bin/llamafactory-cli", line 8, in <module> sys.exit(main()) File "/home/ubuntu/factory/LLaMA-Factory/src/llamafactory/cli.py", line 111, in main run_exp() File "/home/ubuntu/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 58, in run_exp run_kto(model_args, data_args, training_args, finetuning_args, callbacks) File "/home/ubuntu/factory/LLaMA-Factory/src/llamafactory/train/kto/workflow.py", line 78, in run_kto train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2052, in train return inner_training_loop( File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2388, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3485, in training_step loss = self.compute_loss(model, inputs) File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1382, in compute_loss loss, metrics = self.get_batch_loss_metrics(model, inputs) File "/home/ubuntu/factory/LLaMA-Factory/src/llamafactory/train/kto/trainer.py", line 196, in get_batch_loss_metrics self.concatenated_forward(model, batch) File "/home/ubuntu/factory/LLaMA-Factory/src/llamafactory/train/kto/trainer.py", line 152, in concatenated_forward target_logps, target_logps_avg = self.forward(model, batch) File "/home/ubuntu/factory/LLaMA-Factory/src/llamafactory/train/kto/trainer.py", line 144, in forward logits = model(**model_inputs, return_dict=True, use_cache=False).logits.to(torch.float32) File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 820, in forward return model_forward(*args, **kwargs) File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 808, in __call__ return convert_to_fp32(self.model_forward(*args, **kwargs)) File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast return func(*args, **kwargs) File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/peft/peft_model.py", line 1577, in forward return self.base_model( File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 188, in forward return self.model.forward(*args, **kwargs) File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 1047, in forward outputs = self.model( File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/factory/LLaMA-Factory/venv/lib/python3.10/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 850, in forward cache_position = torch.arange( RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile withTORCH_USE_CUDA_DSA` to enable device-side assertions.

../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [1,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [2,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [3,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [4,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [5,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [6,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [7,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [8,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [9,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [10,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [11,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [12,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [13,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [14,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [15,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [16,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [17,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [18,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [19,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [20,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [21,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [22,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [23,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [24,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [26,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [27,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [28,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [1045,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.
0%| | 0/6033 [00:00<?, ?it/s]`

Reproduction

Run KTO finetuning task with gemma 2 basemodel.

Expected behavior

Training should work

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Oct 23, 2024
@Sehyo
Copy link
Author

Sehyo commented Oct 23, 2024

Update, the issue seems to happen if I use "chatml" template. if I use "gemma" template, it works.

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Oct 24, 2024
@hiyouga
Copy link
Owner

hiyouga commented Oct 24, 2024

we should use correct chat template for gemma models

@hiyouga hiyouga closed this as completed Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants