-
Notifications
You must be signed in to change notification settings - Fork 26.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LlamaRotaryEmbedding (wrong cache value when casting model to float16/bfloat16) #25681
Comments
cc @gante you recently worked on the extension of the cache for RotaryEmbeddings! Might affect other (dynamic ones) |
Hey @KeremTurgutlu 👋 It is known that, when casting to 16 bits for inference purposes, you should use the exact casting strategy as used with the model at train time. We try to store that in the In this particular case, the issue is compounded by the fact that the RoPE layer has buffers, which mask the issue in some cases. @ArthurZucker should we emit a warning when the model gets converted to a 16-bit format different from the |
This is the same bug that's discussed here The fix is to calculate sin and cos values in init and ensure they're not stored in buffers. Or don't cast the model, but instead use autocast, which avoids this issue. Note that with deepspeed it will always cast, so you need the fix. |
There's also this #24262 and if we can have a code fix would be awesome than having another warning |
@KeremTurgutlu
inv_freq is always float32 since it's converted using |
No, it's stored as a buffer, so it gets cast in some situations. See the full description of the bug and code to fix it here: EleutherAI/gpt-neox#1003 |
@ArthurZucker @gante I don't think this issue should be closed AFAICT. |
Yep, it’s on my todo when I’ll deep dive on all the llama related issues |
Sorry I'll get to this soon 🤗 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
transformers
version: 4.31.0- distributed_type: FSDP
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- fsdp_config: {'fsdp_auto_wrap_policy': 'SIZE_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_forward_prefetch': False, 'fsdp_min_num_params': 100000000, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 2, 'fsdp_state_dict_type': 'FULL_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_use_orig_params': True}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- dynamo_config: {'dynamo_backend': 'INDUCTOR'}
Who can help?
@ArthurZucker would be the best person to discuss this.
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
TL;DR If a model with a
LlamaRotaryEmbedding
layer is cast to bfloat16/float16 after initialization and if during forward pass a sequence with a sequence length >self.max_position_embeddings
is used, then the cached cos and sin buffer values will most probably be different than the trained model, giving unexpected results.I came across this very subtle error doing the following and I am not sure what might be the best solution for this.
I finetuned the Llama-2 model using accelerate FSDP and bfloat16 mixed precision policy. I used a slightly different config than the original one in which the
max_position_embeddings=2048
was set. FSDP + accelerate uses autocast under the hood which takes care of the ops insideLlamaRotaryEmbedding
to be in full precision which is great.Problem happens when we feed a sequence with a greater sequence length and also cast the model to a lower precision as opposed to using autocast. I loaded this trained model using
My custom config looked like this, notice
"max_position_embeddings": 2048,
:During inference when testing the trained model my training/validation perplexity increased from ~2.5 to ~20.0, it took me 2 days to figure out that the exact issue was with model casting + having sequence lengths > max_position_embeddings.
Potential Fixes:
self.max_position_embeddings
value so that cos-sin caches won't be re-initialized with wrong values due to lower precision. Even using,self.max_position_embeddings=80k
should be fine given the relatively small size of the buffer compared to total model size.LlamaRotaryEmbedding
so that always float32 is used in ops and cast tox.dtype
only at the very end. This is a bit difficult because if a model is cast to bfloat16/float16, it will still produce different cache values even if its cast back to float32. I don't know if there is way to disable model casting for certain layers - but I guess that would be autocast 😄This modified version will produce closer but still wrong cache values:
I personally will keep
self.max_position_embeddings
as high as my max intended sequence length and also will use autocast where possible.Reproduction
cc: @ArthurZucker
Expected behavior
Same cache values for rotary embeddings.
The text was updated successfully, but these errors were encountered: