Add flash_attn support #306

gokturkDev · 2024-07-18T14:43:13Z

Related to #304 #17

After A LOT of digging, using the Dockerfile setup below allows the use of flash_attn.
The only trouble I see with maintaining this long-term is package compatibility with infinity.
The base image uses python3.10.12 with torch 2.4.1. Flash_attn is extremely picky with dependencies so I am not sure whether other versions would work but as long as infinity continues supporting torch>=2.2 and python>=3.9 I think its fine

Main Points

Uses nvcr.io/nvidia/pytorch:24.06-py3
- Using images directly from nvidia docker hub has this weird compatibility issue that I haven't been able to circumvent anyhow. As mentioned here, flash_attn hardcodes the Cuda version when building the wheels, which is probably the reason for this. Using the image from nvcr.io as the author of flash_attn suggested solved this issue
Circumvent poetry install
- Flash_attn doesn't support poetry installs, therefore the bundle uploaded to pypi is used for downloading the dependencies for infinity
Hardcodes python 3.10
- This isn't a must but the base image uses python 3.10.12, so I just rolled with it to avoid any trouble with the base installations. Could experiment with building for python3.11

michaelfeil

I’ll not approve flash-attn support as mentioned.

gokturkDev · 2024-07-18T15:34:57Z

Why not? This is not intended to replace the existing build procedure, just an option for those that need to use it. My company can afford the extra dollars to afford the gpus/cpus that support flash attention etc. to have the latest models. A lot of other startups would be in the same position I believe.

michaelfeil · 2024-07-18T16:47:51Z

Understood, but it might be challenging to maintain.
The docker image is not backed by unit tests.

There is a tensortrt stage in the main dockerfile. Are you able to contribute it in similar way?

Not sure if you are aware, but I am activley using flash-attention-2 via torch. https://pytorch.org/blog/pytorch2-2/
For the specific Qwen models, there is indeed no need in theory (performance based), for the flash-attn package. Its a missed detail to implement that by the authors.

gokturkDev · 2024-07-18T17:34:03Z

There is a tensortrt stage in the main dockerfile. Are you able to contribute it in similar way?

Can do, will get that done by end of week

Not sure if you are aware, but I am activley using flash-attention-2 via torch. https://pytorch.org/blog/pytorch2-2/
For the specific Qwen models, there is indeed no need in theory (performance based), for the flash-attn package. Its a missed detail to implement that by the authors.

I am not active in pytorch/sentence-transformers scene so I am not sure if i'm the right person to open up an issue. However, if you do I would be delighted to show all the horrible things going on with the flash_attn repo rn 😂

michaelfeil · 2024-07-18T21:53:56Z

There is a tensortrt stage in the main dockerfile. Are you able to contribute it in similar way?

Can do, will get that done by end of week

Sounds exciting. I think the flash-attn repo should be installable with pip in the main docker image. An extra stage should be minimal maintainance.

gokturkDev · 2024-07-21T19:32:55Z

Unfortunately I gotta delay this into next week. We are very close to launching our Beta in my company so I am absolutely swamped right now. Want to properly test it

codecov-commenter · 2024-07-21T19:39:14Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.69%. Comparing base (d48aa67) to head (2e5bf6e).
Report is 2 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #306   +/-   ##
=======================================
  Coverage   77.69%   77.69%           
=======================================
  Files          35       35           
  Lines        2511     2511           
=======================================
  Hits         1951     1951           
  Misses        560      560

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

* Add flash_attn support (#306) * add dockerfile for flash_attn setup * remove test.py * parametrize model name and engine * Update Dockerfile --------- Co-authored-by: Michael Feil <[email protected]> * Delete libs/infinity_emb/Dockerfile.flash --------- Co-authored-by: Göktürk <[email protected]>

gokturkDev added 3 commits July 18, 2024 17:25

add dockerfile for flash_attn setup

06a75c2

remove test.py

01c0257

parametrize model name and engine

45e7d21

gokturkDev marked this pull request as ready for review July 18, 2024 14:44

michaelfeil reviewed Jul 18, 2024

View reviewed changes

michaelfeil mentioned this pull request Jul 21, 2024

flash_attn not found when using docker, and Alibaba-NLP/gte-Qwen2-1.5B-instruct #308

Closed

4 tasks

Merge branch 'main' into main

2e5bf6e

Update Dockerfile

53b75ae

michaelfeil changed the base branch from main to flash-attn-docker July 21, 2024 21:50

Merge branch 'flash-attn-docker' into main

cc4835b

michaelfeil merged commit fddd955 into michaelfeil:flash-attn-docker Jul 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flash_attn support #306

Add flash_attn support #306

gokturkDev commented Jul 18, 2024

michaelfeil left a comment

gokturkDev commented Jul 18, 2024 •

edited

Loading

michaelfeil commented Jul 18, 2024 •

edited

Loading

gokturkDev commented Jul 18, 2024 •

edited

Loading

michaelfeil commented Jul 18, 2024

gokturkDev commented Jul 21, 2024

codecov-commenter commented Jul 21, 2024 •

edited

Loading

Add flash_attn support #306

Add flash_attn support #306

Conversation

gokturkDev commented Jul 18, 2024

Main Points

michaelfeil left a comment

Choose a reason for hiding this comment

gokturkDev commented Jul 18, 2024 • edited Loading

michaelfeil commented Jul 18, 2024 • edited Loading

gokturkDev commented Jul 18, 2024 • edited Loading

michaelfeil commented Jul 18, 2024

gokturkDev commented Jul 21, 2024

codecov-commenter commented Jul 21, 2024 • edited Loading

Codecov Report

gokturkDev commented Jul 18, 2024 •

edited

Loading

michaelfeil commented Jul 18, 2024 •

edited

Loading

gokturkDev commented Jul 18, 2024 •

edited

Loading

codecov-commenter commented Jul 21, 2024 •

edited

Loading