Flash Attention regurgitates repeated tokens - seq2seq #1752

ArtanisTheOne · 2024-08-03T01:14:31Z

Having some generation issues with NMT models trained with OpenNMT-py, which include OpenNMT-py versions before flash attention existed and one I'm currently training with the most recent which include flash attention. Models were converted using onmt_release_model with storage quantization set to int8.

Happens when turning flash_attention=True on when creating the ctranslate.Translator object. GPU is a RTX 3090.

Don't know if this is just an arch issue or something to do with the conversion process from opennmt-py.

Examples of some outputs given the Flores200 benchmark

sss of of of of of of of of of of of
sss                                                     in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in          in in in in in in   patients patients patients in in in in in in in  in in in in in   in in in in in in in in in in in in in in in in in in in in in in patients in in in in in      in in in in in in in in patients patients patients patients patients patients patients patients patients patients patients patients patients in in patients patients patients patients patients in countries countries in in in in in in in in in             in in in in in in in in in in     in in in in in
ssss
ssmmmmmmmm
ss
__opt_src_en__opt_src_en__opt_src_en
sss
sss                       of of of of of of of of of                         of of
sss                                                tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax                  tax tax tax tax tax tax tax tax tax tax tax tax tax     tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax

sssmmmmmmmmmmmmmmmmmmm

minhthuc2502 · 2024-08-12T12:52:43Z

It should work with old or new version of Onmt-py. I don't have enough information to help you sorry.

FYI, I will disable flash attention feature in the future ctranslate2 version because of it does not improve much the performance with the inference and make the package quite a lot heavier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash Attention regurgitates repeated tokens - seq2seq #1752

Flash Attention regurgitates repeated tokens - seq2seq #1752

ArtanisTheOne commented Aug 3, 2024

minhthuc2502 commented Aug 12, 2024

Flash Attention regurgitates repeated tokens - seq2seq #1752

Flash Attention regurgitates repeated tokens - seq2seq #1752

Comments

ArtanisTheOne commented Aug 3, 2024

minhthuc2502 commented Aug 12, 2024