Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama_self_extend_patch_4_36 is not work #23

Closed
YL-9 opened this issue Mar 8, 2024 · 7 comments
Closed

llama_self_extend_patch_4_36 is not work #23

YL-9 opened this issue Mar 8, 2024 · 7 comments

Comments

@YL-9
Copy link

YL-9 commented Mar 8, 2024

when I use 4.36.2, it's not work. But if I use 4.32.0, it's work.
I only changed "import llama_self_extend_patch as LlamaSE" in "llama_example.py" to "import llama_self_extend_patch_4_36 as LlamaSE"

@Mooler0410
Copy link
Collaborator

Mooler0410 commented Mar 9, 2024

We found that after 4.36, the default attention of llama is changed from "LlamaAttention" to "LlamaSdpaAttention". Hence the replacement function will not work. Instead, you may try:

modify_method_of_instance(base_model, "LlamaAttention", "forward", self_extend_forward)
--> modify_method_of_instance(base_model, "LlamaSdpaAttention", "forward", self_extend_forward)

This might be the reason for the failure.

@YL-9
Copy link
Author

YL-9 commented Mar 9, 2024

We found that after 4.36, the default attention of llama is changed from "LlamaAttention" to "LlamaSdpaAttention". Hence the replacement function will not work. Instead, you may try:

modify_method_of_instance(base_model, "LlamaAttention", "forward", self_extend_forward) --> modify_method_of_instance(base_model, "LlamaSdpaAttention", "forward", self_extend_forward)

This might be the reason for the failure.

it work, thank you.
I have another question. I want to add it here. but it can still only run normally on 4.32, and the running result of 4.36 is still wrong.
I just added the following three pieces of code and used this command to run them: CUDA_VISIBLE_DEVICES=0,1 python eval/passkey.py --model /data/supry/models/llama-2/llama2-7b-hf --min-tokens 4096 --max-tokens 8192 --tokens-step 4096 --length-step 1024 --iterations 20 --serope
MW9HYFAAUK~T{GVP AJT8S](https://github.com/datamllab/LongLM/assets/73892208/ac8c2bc8-b5f1-4215-b77c-dd37da0523e2) ![~FT7XB3NFIFLJHT%2KFFWSJ](https://github.com/datamllab/LongLM/assets/73892208/ce7832ec-db10-4f78-9dd3-ba04549495d6) ![VFY8E1_DN)BRU1~EL{~K@I

@YL-9
Copy link
Author

YL-9 commented Mar 9, 2024

MW9`HYFAAUK~T{GVP AJT8S
$QY)X~9HYTK{2}0 I4K4_W

@Mooler0410
Copy link
Collaborator

Hi YL-9! Could you please test whether self-extend can work by instance wise modification, like the example we provide? Sometimes, direct modification to the transformers' class does not take effect, while the cause of failure is case by case. That's the reason why we choose to modify the forward function of a model instance rather than its class. (Of course,, this can avoid any unexpected behavior for the modification only happens to the specific model instance)

@Mooler0410 Mooler0410 pinned this issue Mar 17, 2024
@YL-9
Copy link
Author

YL-9 commented Mar 18, 2024

ok, thank you!

@ys-zong
Copy link

ys-zong commented Apr 1, 2024

Hi, thanks for the nice work! I see the current implementation in llama_self_extend_patch_4_36.py is regular pytorch. I wonder if you plan to implement Flash attention for transformers==4.36?

@Mooler0410
Copy link
Collaborator

Hi, thanks for the nice work! I see the current implementation in llama_self_extend_patch_4_36.py is regular pytorch. I wonder if you plan to implement Flash attention for transformers==4.36?

Hi, thank you for your interests. The main different between transformers==4.36 and transformers==4.38.2 is how the RoPE is applied to KV. You may have a check. The computation of self attention is nearly the same. This means you can follow our 4.38.2 implementation to have a flash attention implementation for 4.36 with minor modification.

One of the possible issues is the flash_attn version used by 4.36. In that case, you may use our triton flash attention implementation instead of flash_attn. It's 10~20% slower than flash_attn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants