Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flash_attn support #306

Merged
merged 6 commits into from
Jul 21, 2024
Merged

Conversation

gokturkDev
Copy link
Contributor

Related to #304 #17

After A LOT of digging, using the Dockerfile setup below allows the use of flash_attn.
The only trouble I see with maintaining this long-term is package compatibility with infinity.
The base image uses python3.10.12 with torch 2.4.1. Flash_attn is extremely picky with dependencies so I am not sure whether other versions would work but as long as infinity continues supporting torch>=2.2 and python>=3.9 I think its fine

Main Points

  • Uses nvcr.io/nvidia/pytorch:24.06-py3
    • Using images directly from nvidia docker hub has this weird compatibility issue that I haven't been able to circumvent anyhow. As mentioned here, flash_attn hardcodes the Cuda version when building the wheels, which is probably the reason for this. Using the image from nvcr.io as the author of flash_attn suggested solved this issue
  • Circumvent poetry install
  • Hardcodes python 3.10
    • This isn't a must but the base image uses python 3.10.12, so I just rolled with it to avoid any trouble with the base installations. Could experiment with building for python3.11

@gokturkDev gokturkDev marked this pull request as ready for review July 18, 2024 14:44
Copy link
Owner

@michaelfeil michaelfeil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll not approve flash-attn support as mentioned.

@gokturkDev
Copy link
Contributor Author

gokturkDev commented Jul 18, 2024

Why not? This is not intended to replace the existing build procedure, just an option for those that need to use it. My company can afford the extra dollars to afford the gpus/cpus that support flash attention etc. to have the latest models. A lot of other startups would be in the same position I believe.

@michaelfeil
Copy link
Owner

michaelfeil commented Jul 18, 2024

Understood, but it might be challenging to maintain.
The docker image is not backed by unit tests.

There is a tensortrt stage in the main dockerfile. Are you able to contribute it in similar way?

Not sure if you are aware, but I am activley using flash-attention-2 via torch. https://pytorch.org/blog/pytorch2-2/
For the specific Qwen models, there is indeed no need in theory (performance based), for the flash-attn package. Its a missed detail to implement that by the authors.

@gokturkDev
Copy link
Contributor Author

gokturkDev commented Jul 18, 2024

There is a tensortrt stage in the main dockerfile. Are you able to contribute it in similar way?

Can do, will get that done by end of week

Not sure if you are aware, but I am activley using flash-attention-2 via torch. https://pytorch.org/blog/pytorch2-2/
For the specific Qwen models, there is indeed no need in theory (performance based), for the flash-attn package. Its a missed detail to implement that by the authors.

I am not active in pytorch/sentence-transformers scene so I am not sure if i'm the right person to open up an issue. However, if you do I would be delighted to show all the horrible things going on with the flash_attn repo rn 😂

@michaelfeil
Copy link
Owner

There is a tensortrt stage in the main dockerfile. Are you able to contribute it in similar way?

Can do, will get that done by end of week

Sounds exciting. I think the flash-attn repo should be installable with pip in the main docker image. An extra stage should be minimal maintainance.

@gokturkDev
Copy link
Contributor Author

Unfortunately I gotta delay this into next week. We are very close to launching our Beta in my company so I am absolutely swamped right now. Want to properly test it

@codecov-commenter
Copy link

codecov-commenter commented Jul 21, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.69%. Comparing base (d48aa67) to head (2e5bf6e).
Report is 2 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #306   +/-   ##
=======================================
  Coverage   77.69%   77.69%           
=======================================
  Files          35       35           
  Lines        2511     2511           
=======================================
  Hits         1951     1951           
  Misses        560      560           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@michaelfeil michaelfeil changed the base branch from main to flash-attn-docker July 21, 2024 21:50
@michaelfeil michaelfeil merged commit fddd955 into michaelfeil:flash-attn-docker Jul 21, 2024
michaelfeil added a commit that referenced this pull request Jul 21, 2024
* Add flash_attn support  (#306)

* add dockerfile for flash_attn setup

* remove test.py

* parametrize model name and engine

* Update Dockerfile

---------

Co-authored-by: Michael Feil <[email protected]>

* Delete libs/infinity_emb/Dockerfile.flash

---------

Co-authored-by: Göktürk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants