Add FlashAttention #24

epwalsh · 2023-03-08T00:10:00Z

Closes #23

⚠️ ~~This is blocked right now on Dao-AILab/flash-attention#132~~ (update: I got this to work with an older version of triton, we'll keep tracking that issue and update triton when it's fixed - #26)

Uses the official FlashAttention implementation. I've prebuilt a Python 3.10, PyTorch 1.13.1, CUDA 11.7 wheel for this which you can install with:

pip install https://storage.googleapis.com/ai2-python-wheels/flash_attn/flash_attn-0.2.8%2Bcu117torch1.13.1-cp310-cp310-linux_x86_64.whl

This will be built in to the Docker images.

For now we'll have to use their triton version since the CUDA version doesn't support arbitrary attention biases, meaning we can't use ALiBi.

The advantages and disadvantages of the Triton implementation are discussed here:

https://github.com/HazyResearch/flash-attention/blob/57ee618170e1adecbf787365cdf330c63768abd2/flash_attn/flash_attn_triton.py#L1-L35

And to add to that, one of our contacts at MosaicML says:

In my experience, IF you are not using ALL the GPUs memory, the triton version is almost always slightly faster than the CUDA FlashAttn implementation.
But its much slower when you are close to using all the memory (happens with larger models).
I've filed an issue in the triton repo
There I note that, in the FSDP config, setting limit_all_gathers to True enables running the model with the triton attn implementation at larger scale.
As noted here, on 128GPUs I can run the 30B model using Triton (which supports bias) and its nearly as fast as using the CUDA version.
Final note: FlashAttn has a rewrite (slated for April) and they plan to support attn bias in May.
With the current implementation there is this RP for supporting bias. I tried building from that branch and running, but the kernel was much slower in that branch...

This PR is based on:

epwalsh added 15 commits March 7, 2023 16:09

try install flash-attn

3a0f6f2

Merge branch 'main' into flash-attn

2fe4a2a

install from whl directly

887992d

use prebuilt wheel

1600f53

add triton

3ccb0ca

implement it

407829a

add test

0ce8744

fix

d36750f

fix

a3b1de7

fixes

4d0c3df

more fixes

624ba7f

fix another

3c0f637

fix

926afc5

updates

1dadafb

try float16

1beb2aa

epwalsh changed the title ~~Get flash attention working~~ Add FlashAttention Mar 8, 2023

epwalsh added 4 commits March 8, 2023 14:21

try bf16

b2e953e

downgrade triton

a58e59c

fix docker images

c992064

CHANGELOG

4adf6bc

epwalsh marked this pull request as ready for review March 8, 2023 22:55

epwalsh requested review from ananyahjha93, dirkgr and ibeltagy March 8, 2023 22:57

epwalsh added 4 commits March 8, 2023 15:11

adjust atol, rtol

3deaa40

restrict clusters

c8e8259

fix

1ca1782

add more GPU tests

771ce77

dirkgr approved these changes Mar 8, 2023

View reviewed changes

ananyahjha93 approved these changes Mar 8, 2023

View reviewed changes

epwalsh added 2 commits March 8, 2023 15:43

try on a100s only

e0e4d36

skip on the fly when no A100

a3597cc

epwalsh merged commit 5222c35 into main Mar 8, 2023

epwalsh deleted the flash-attn branch March 8, 2023 23:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FlashAttention #24

Add FlashAttention #24

epwalsh commented Mar 8, 2023 •

edited

Loading

Add FlashAttention #24

Add FlashAttention #24

Conversation

epwalsh commented Mar 8, 2023 • edited Loading

epwalsh commented Mar 8, 2023 •

edited

Loading