Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 4.4.0 and flash attention with python [WIP] #1775

Closed
BBC-Esq opened this issue Sep 10, 2024 · 3 comments
Closed

Release 4.4.0 and flash attention with python [WIP] #1775

BBC-Esq opened this issue Sep 10, 2024 · 3 comments

Comments

@BBC-Esq
Copy link

BBC-Esq commented Sep 10, 2024

It looks like Flash Attention was removed from the Python portion in release 4.4.0...I had a few questions:

  1. Can you confirm that flash attention is still available in Release 4.3.1? No benchmarking was done on long-context QA with/without Flash Attention 2 that I'm aware of...only "relatively" short prompts/contexts. I'd like to be able to bench FA on longer contexts still to see if there's a meaningful benefit.

  2. Is it possible to compile version 4.4.0 to include Flash Attention in a Wheel file even though a Wheel file won't be uploaded to pypi.org? If it's worth it, I'd like to use version 4.4.0's improvements AND flash attention if my benchmarking indicates it's advantageous to do so. I'm not very familiar with compiling in general so forgive the Q, but basically if I compile will it include the relevant "python" portions that you say are not omitted?

Thanks again for the great work.

@minhthuc2502
Copy link
Collaborator

  1. I conducted some benchmarks with a long context (around 3000 tokens) and did not observe significant improvements. If you can do this benchmark on their side, I’d appreciate it. The release 4.3.1 supports always flash attention (there are some improvements in 4.4.0 release but not much, you can test with 4.3.1 is enough)

  2. If we don't push the wheel file to pypi.org, we will need to establish a new release process similar to Flash Attention releases. For simplicity, more work is required on the Flash Attention feature, so we can reactivate it at a later stage.

@BBC-Esq BBC-Esq changed the title Release 4.4.0 and flash attention with python Release 4.4.0 and flash attention with python [WIP] Sep 10, 2024
@BBC-Esq
Copy link
Author

BBC-Esq commented Sep 10, 2024

Will bench in the near future when I have the time, hence WIP in the title, and let ya'll know if my results are different. I previously benched and noticed significant benefits when solely the beam_size parameter was changed, but never got around to benching much longer contexts - e.g. 8k/16k - which is starting to become the norm (like 4k was to 2k, etc.)

@BBC-Esq
Copy link
Author

BBC-Esq commented Oct 15, 2024

UPDATE; Don't have time to bench but will try my best in the future. Closing for now.

@BBC-Esq BBC-Esq closed this as completed Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants