-
Notifications
You must be signed in to change notification settings - Fork 26.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Dynamic" Issue in LlamaDynamicNTKScalingRotaryEmbedding - Long context inference will impact short context inference. #25306
Comments
Hey! Thanks for reporting, this is a duplicate of #25104. Will link it in the PR as well |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
No, they're not same. I understand #25104 is about the trade off between using kv cache and rotary embed inconsistence. But when you freeze everything during generation including random seeds, same input should give same output sequence. The dynamic ntk rotary will only recalculate if input seq is longer than cached. What if the longest sequence is predicted at first? Cached embed will never change again. PR #25308 is a correct fix without extra calculate. I think it should be merged. |
I see. Makes sense for me @gante if you can have a look! 🤗 |
@i4never I agree, it is a limitation of the technique when implemented as the authors suggest. #25308 is not the correct fix either -- we should only resize the Would you like to open a PR to fix it? :) |
|
System Info
transformers
version: 4.32.0.dev0Who can help?
@sgugger
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Please see my colab code:
https://colab.research.google.com/drive/1SnQQxW7WMHgSOvAwF_HIlIDrAuXZ4IKp?usp=sharing
I asked the same prompt twice, with a long-context prompt inserted in between. However, this intermediate long-context inference resulted in different answers for the same question before and after it.
Expected behavior
Since the input length of the tested prompts is within the maximum input token capacity the model can handle, the significance of "Dynamic" lies in ensuring that the embeddings for the inputs before and after remain the same, and consequently, the output results should also be the same.
I reviewed the code of the class "LlamaDynamicNTKScalingRotaryEmbedding" and I think that due to caching, when the model infers a long context, the cached values of
cos_cached
andsin_cached
are updated to adapt to the longer context. This causes the issue when the model infers a shorter context again.The text was updated successfully, but these errors were encountered: