You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Having some generation issues with NMT models trained with OpenNMT-py, which include OpenNMT-py versions before flash attention existed and one I'm currently training with the most recent which include flash attention. Models were converted using onmt_release_model with storage quantization set to int8.
Happens when turning flash_attention=True on when creating the ctranslate.Translator object. GPU is a RTX 3090.
Don't know if this is just an arch issue or something to do with the conversion process from opennmt-py.
Examples of some outputs given the Flores200 benchmark
sss of of of of of of of of of of of
sss in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in patients patients patients in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in patients in in in in in in in in in in in in in patients patients patients patients patients patients patients patients patients patients patients patients patients in in patients patients patients patients patients in countries countries in in in in in in in in in in in in in in in in in in in in in in in in
ssss
ssmmmmmmmm
ss
__opt_src_en__opt_src_en__opt_src_en
sss
sss of of of of of of of of of of of
sss tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax
sssmmmmmmmmmmmmmmmmmmm
The text was updated successfully, but these errors were encountered:
It should work with old or new version of Onmt-py. I don't have enough information to help you sorry.
FYI, I will disable flash attention feature in the future ctranslate2 version because of it does not improve much the performance with the inference and make the package quite a lot heavier.
Having some generation issues with NMT models trained with OpenNMT-py, which include OpenNMT-py versions before flash attention existed and one I'm currently training with the most recent which include flash attention. Models were converted using onmt_release_model with storage quantization set to int8.
Happens when turning flash_attention=True on when creating the ctranslate.Translator object. GPU is a RTX 3090.
Don't know if this is just an arch issue or something to do with the conversion process from opennmt-py.
Examples of some outputs given the Flores200 benchmark
The text was updated successfully, but these errors were encountered: