Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flash Attention regurgitates repeated tokens - seq2seq #1752

Open
ArtanisTheOne opened this issue Aug 3, 2024 · 1 comment
Open

Flash Attention regurgitates repeated tokens - seq2seq #1752

ArtanisTheOne opened this issue Aug 3, 2024 · 1 comment

Comments

@ArtanisTheOne
Copy link

Having some generation issues with NMT models trained with OpenNMT-py, which include OpenNMT-py versions before flash attention existed and one I'm currently training with the most recent which include flash attention. Models were converted using onmt_release_model with storage quantization set to int8.

Happens when turning flash_attention=True on when creating the ctranslate.Translator object. GPU is a RTX 3090.

Don't know if this is just an arch issue or something to do with the conversion process from opennmt-py.

Examples of some outputs given the Flores200 benchmark

sss of of of of of of of of of of of
sss                                                     in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in          in in in in in in   patients patients patients in in in in in in in  in in in in in   in in in in in in in in in in in in in in in in in in in in in in patients in in in in in      in in in in in in in in patients patients patients patients patients patients patients patients patients patients patients patients patients in in patients patients patients patients patients in countries countries in in in in in in in in in             in in in in in in in in in in     in in in in in
ssss
ssmmmmmmmm
ss
__opt_src_en__opt_src_en__opt_src_en
sss
sss                       of of of of of of of of of                         of of
sss                                                tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax                  tax tax tax tax tax tax tax tax tax tax tax tax tax     tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax

sssmmmmmmmmmmmmmmmmmmm
@minhthuc2502
Copy link
Collaborator

It should work with old or new version of Onmt-py. I don't have enough information to help you sorry.

FYI, I will disable flash attention feature in the future ctranslate2 version because of it does not improve much the performance with the inference and make the package quite a lot heavier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants