Release 4.4.0 and flash attention with python [WIP] #1775

BBC-Esq · 2024-09-10T14:19:37Z

It looks like Flash Attention was removed from the Python portion in release 4.4.0...I had a few questions:

Can you confirm that flash attention is still available in Release 4.3.1? No benchmarking was done on long-context QA with/without Flash Attention 2 that I'm aware of...only "relatively" short prompts/contexts. I'd like to be able to bench FA on longer contexts still to see if there's a meaningful benefit.
Is it possible to compile version 4.4.0 to include Flash Attention in a Wheel file even though a Wheel file won't be uploaded to pypi.org? If it's worth it, I'd like to use version 4.4.0's improvements AND flash attention if my benchmarking indicates it's advantageous to do so. I'm not very familiar with compiling in general so forgive the Q, but basically if I compile will it include the relevant "python" portions that you say are not omitted?

Thanks again for the great work.

minhthuc2502 · 2024-09-10T15:43:45Z

I conducted some benchmarks with a long context (around 3000 tokens) and did not observe significant improvements. If you can do this benchmark on their side, I’d appreciate it. The release 4.3.1 supports always flash attention (there are some improvements in 4.4.0 release but not much, you can test with 4.3.1 is enough)
If we don't push the wheel file to pypi.org, we will need to establish a new release process similar to Flash Attention releases. For simplicity, more work is required on the Flash Attention feature, so we can reactivate it at a later stage.

BBC-Esq · 2024-09-10T15:52:16Z

Will bench in the near future when I have the time, hence WIP in the title, and let ya'll know if my results are different. I previously benched and noticed significant benefits when solely the beam_size parameter was changed, but never got around to benching much longer contexts - e.g. 8k/16k - which is starting to become the norm (like 4k was to 2k, etc.)

BBC-Esq · 2024-10-15T12:38:04Z

UPDATE; Don't have time to bench but will try my best in the future. Closing for now.

BBC-Esq changed the title ~~Release 4.4.0 and flash attention with python~~ Release 4.4.0 and flash attention with python [WIP] Sep 10, 2024

BBC-Esq closed this as completed Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 4.4.0 and flash attention with python [WIP] #1775

Release 4.4.0 and flash attention with python [WIP] #1775

BBC-Esq commented Sep 10, 2024

minhthuc2502 commented Sep 10, 2024

BBC-Esq commented Sep 10, 2024

BBC-Esq commented Oct 15, 2024

Release 4.4.0 and flash attention with python [WIP] #1775

Release 4.4.0 and flash attention with python [WIP] #1775

Comments

BBC-Esq commented Sep 10, 2024

minhthuc2502 commented Sep 10, 2024

BBC-Esq commented Sep 10, 2024

BBC-Esq commented Oct 15, 2024