[WIP] First draft for softcapping. #1025

Narsil · 2024-07-03T19:39:49Z

Quick&dirty implementation (but seems functional).

tridao · 2024-07-03T19:44:03Z

You can add these to the setup.py to reduce compilation time:

                        "-DFLASHATTENTION_DISABLE_BACKWARD",
                        "-DFLASHATTENTION_DISABLE_DROPOUT",
                        "-DFLASHATTENTION_DISABLE_ALIBI",
                        "-DFLASHATTENTION_DISABLE_UNEVEN_K",
                        "-DFLASHATTENTION_DISABLE_LOCAL",

Narsil · 2024-07-03T19:44:57Z

Oh nice missed them !

tridao · 2024-07-03T19:46:36Z

I think softcapping should be done before the masking. i.e. the sequence is gemm, softcapping, masking, then softmax.
If you do softcapping after masking, then some masked tokens will contribute a tiny amount to the softmax. In practice it's probably ok if softcap value is large (like 50) but if it's small (e.g. 1.0), this can lead to information leakage from future tokens to past tokens.

tridao · 2024-07-03T19:50:13Z

softcapping can be fused with dividing by softmax_scale.
i.e. we do S = gemm(Q, K), then S *= softmax_scale * 1 / softcap where "softmax_scale * 1 / softcap" is a constant we can compute before hand and put in the params.
Then we do masking, then take the max.
When it's time to do exp, we do exp(scores * softcap - max * softcap). This will use a fused multiply add so it's just 1 instruction.

Narsil · 2024-07-04T06:45:49Z

There is still the tanh that's missing somewhere, where would you put it ?
I was thinking adding a const template for softcapping so the cost of the branch wouldn't affect non softcapped kernels.

wdyt ?

tridao · 2024-07-04T06:57:37Z

Sorry I missed the tanh.
The step should be S = gemm(Q, K), then S = tanh(softmax_scale * 1 / softcap), then masking, taking max, then exp2f(scores * softcap * log_2(e) - max * softcap * log_2(e)).

Yeah we should template to avoid slowing down the usual attention.

Narsil · 2024-07-04T14:18:51Z

Ok I put the template for Is_softcapping, however I cannot get your idea working. I may be missing something.

My understandling is that

tanh(x * softmax_scale * 1 / softcap)

Happens right when I was already doing it (but after gemm, before masking since tanh would throw -inf to -50).
To adding a new op there.
https://github.com/Dao-AILab/flash-attention/pull/1025/files#diff-9e1775131ae22a74dc4e0333c57539573394a059db8097ef74fae24243347ce1R330-R332

flash::gemm</*A_in_regs=*/Kernel_traits::Is_Q_in_regs>(
    acc_s, tSrQ, tSrK, tSsQ, tSsK, tiled_mma, smem_tiled_copy_Q, smem_tiled_copy_K,
    smem_thr_copy_Q, smem_thr_copy_K
);
// if (cute::thread0()) { print(acc_s); }
if constexpr (Is_softcapping){
    cute::transform(acc_s, softcapping_op{params.softcapping_scale});
}

And then I updated softmax_scale and softmax_scale_log2 to the equivalent but with softcapping value instead of the regular softmax_scale (since it's effectively taking it's place).

https://github.com/Dao-AILab/flash-attention/pull/1025/files#diff-406036c9702cf749b9e58833b342cfeb66a40c0faa1b43e2e8610f43c1332a5bR104-R118

if (softcapping.has_value()){
    params.is_softcapping = true;
    params.softcapping_scale = softmax_scale / softcapping.value();
    params.scale_softmax_log2 = softcapping.value() * M_LOG2E;
    params.scale_softmax = softcapping.value();
}else{
    params.is_softcapping = false;
    params.softcapping_scale = 1.0;
    params.scale_softmax_log2 = softmax_scale * M_LOG2E;
    params.scale_softmax = softmax_scale;
}

Narsil · 2024-07-04T15:15:34Z

My understanding is that scales_softmax_log2 is only ever used in the partial exponentiation (the log2 is to use exp2f which I assume has better intrinsics than expf, or is it more about numeric stability ?).

scale_softmax is only used in the renormalization which as I understand is used to return the softmax to users that ask for them.

There is never partial sum over in the gemm right ? (meaning gemm being split along the hidden_dim rank, meaning the tanh wouldn't capture the entire sum, I don't think it is but I'm looking for culprits).

Do you have anything better to suggest than printf debugging for this ?

lucidrains · 2024-07-04T16:34:09Z

@Narsil oh yea, your code looks way better than what i have lmao, let's just go with your changes

so eventually you'll have to expose a customizable softcapping_scale, which you should default to 30. (iirc used by Grok and Gemma2). maybe researchers will find a better value in the future

lucidrains · 2024-07-04T16:35:10Z

@Narsil backwards pass looks something like this in naive pytorch - and done another way here. both examples are functional

i understand things are 10-20x harder translating it to CUDA

lucidrains · 2024-07-04T16:35:47Z

@Narsil great job getting the ball rolling 🙏

Narsil · 2024-07-04T16:41:57Z

Thanks, but this is not functional at all. (I think using the DISABLE_ flags break flash already).

lucidrains · 2024-07-04T16:44:23Z

@Narsil i think you are really close!

lucidrains · 2024-07-04T17:58:09Z

@Narsil yea, it is really hard to contribute with these compilation times. how fast were you able to get the times down to? i'm still waiting up to 10 minutes per change

edit: it is 26 minutes, just timed it, let me work on cutting that down rest of the day, or this is impossible to tinker with

edit 2: got it down to 2 minutes! 😄

edit 3: think i got the forward working! a lot of it was just following your lead with cute::transform, so i think you may just be missing something small

edit 4: i'll just move onto the backwards pass, since i think you have it in the bag with the forwards.

Narsil · 2024-07-05T08:38:52Z

@lucidrains Can't see your changes anywhere, are you on a branch somewhere ?

I got compilation times down to 1mn but results seems wrong with them (meaning the regular non softcapped flash fail).
I am starting a branch from scratch to bisect code removal to keep at least 1 test working and compilation times lower.

To reduce compilation times I uncomment the compilation flags in setup.py ( -DXXXX). Each flag divides the time in half (since they are all booleans afaik. LOCAL and UNEVEN_K I keep (not entirely sure exactly when but they seem necessary for the test I'm keeping around).

The I also limit the amount of hidden size to only 1, meaning, comment the associated kernels in setup.py and then updating things a bit everywhere to cut those to have symbols in the binary (Undefined symbol fun).
Basically needs the changes made here. (launch_temaplte + static_switch at least).

lucidrains · 2024-07-05T12:08:04Z

@Narsil nice, i got it to around the same ballpark!

unfortunately was working off a runpod that went down before i can push the changes 😞. but the good news is that i was luckily able to get backwards pass working as well 🥳

i compared some of the changes for the forward hoping to catch what was wrong in your diff, but couldn't find a difference. in fact i looked to your changes first and used the way you did the cute::transform, so it must be something small on your end

Narsil · 2024-07-05T12:39:16Z

Nice doing the backward !
You're saying this branch is supposed to work ?

Can I try your branch ? (My local changes were exactly this branch).
A10G (sm_86).
Cuda 12.5
Ubuntu 20.04
Here. I've also dealt a few times with my path being spoiled (so 2 versions of flash coexisted until I manually cleared).

lucidrains · 2024-07-05T12:42:47Z

@Narsil lost my changes, but will work on restoring it later today (traveling with dog for the 4th, American holiday)

yes, i think you are probably just off by some scale, error could even be in your tests

lucidrains · 2024-07-05T12:45:58Z

i also want to do a separate PR just to make it easy for contributors to get started (specify a few hyperparameters, and enable/disable booleans for those flags, and only those kernels get compiled and tested)

that ended up being the hardest part of all this

Shreya-Pathak · 2024-07-05T14:03:59Z

Hello, I was separately working on adding softcapping to FA for Gemma 2. But seeing this PR, I'll add my WIP repo here if needed for double-checking: https://github.com/Shreya-Pathak/flash-attention. It is using the same idea as discussed above. The forward seemed correct (read passing tests) when I was checking but let me know if you see some issues.
Additionally, I'm eager to get softcapping supported in FA as soon as possible so please let me know how I can contribute. If any support is needed on the backward pass, will be happy to spend some time on it.

lucidrains · 2024-07-05T14:11:31Z

@Shreya-Pathak looks great Shreya! i actually prefer your way for the interface, with boolean flag to turn on with a default softcapping scale (realistically the public won't know the right value, so it should just default to something proven and working) But Nicolas' way with the cute::transform on the cuda end seems cleaner

lucidrains · 2024-07-05T14:12:36Z

regardless, i think the forward issue is done for, and Gemma2 inference will be viable soon

i'll PR in the backwards pass once one of you get your changes in. there's actually two ways to approach backwards, and i'm not sure which is the right way, so may need to consult Tri

Narsil · 2024-07-05T15:12:10Z

Hi @Shreya-Pathak,

I looked at your changes.

Personally I prefer the single flag (less things to know for users, and softcapping is unlikely to have a good default. Gemma2 uses 50, not 30 for instance: https://huggingface.co/google/gemma-2-9b-it/blob/main/config.json#L7).
Taste and preferences I guess, ultimately it's a core maintainer's job to decide I think.

What you did was the first iteration I did. And tri mentionned that we could do smarter things here: #1025 (comment)

Basically currently FA applies the scale during the exponentiation of the softmax (to exploit intrisics that can fuse both ops in a single instruction). Therefore we should be able to get away by using the softmax_scale / softcap directly insinde the tanh, and only replace the current softmax_scales with softcap which will get multiplied as usual (keeping the instruction fused).

Doesn't seem to be working for me though.

lucidrains · 2024-07-05T15:21:34Z

@Narsil ah, good to know that Gemma 2 used 50., i believe Grok used 30.. and yea, it is a matter of preference (edit: actually may change my mind on this, since they are using different values probably better to leave it undefaulted)

so when you say things don't work, it is the changes you made to account for Tri's suggestion? your first iteration worked?

lucidrains · 2024-07-05T16:23:47Z

@Narsil ok, i'm sure you'll figure it out. @Shreya-Pathak can probably help too

ping me when you want me to throw up the backwards pass

Shreya-Pathak · 2024-07-05T17:33:30Z

@Narsil I think I have also done what Tri mentioned with the softcap / softmax scale and from a brief look at your code, you seem to be doing the same as well. Could you give more details about what tests are failing and what the differences are?
Regarding the default value, you can still change the softcap value in the calling function AFAICT but the difference is only in style so anything is fine by me.

tridao · 2024-07-05T21:08:18Z

I think we should pass to tanh with softclapping_scale = softmax_scale (e.g. 1/sqrt(headdim) / softcap_val (e.g. 50.0).
As an example, with headdim = 128 and softcap = 50, we would do tanh(acc_s * 1/sqrt(128) / 50) = tanh(acc_s * 1.77e-3).

Then in the softmax, instead of passing in softmax_scale_log2, we should pass in softcap_val * log2(e).
In this example we would have 50.0 * log2(e) = 72.1.

Overall we're doing exp2(log2(e) * 50 * tanh(acc_s * softmax_scale / 50)).
But hopefully by multiplying these constants together before hand we can reduce the number of instructions.

Narsil · 2024-07-06T01:30:16Z

OK it's updated and now working.

I have no idea why but cute::transform seems to be the culprit.
Thanks @Shreya-Pathak for the apply_softcap function that I did use in the end.

lucidrains · 2024-07-06T01:42:20Z

OK it's updated and now working.

I have no idea why but cute::transform seems to be the culprit. Thanks @Shreya-Pathak for the apply_softcap function that I did use in the end.

that's strange! i used cute::transform in both fwd and bwd

ah no matter, all roads lead to rome

Narsil · 2024-07-06T11:23:11Z

Do you want to patch this PR for the backward ?

lucidrains · 2024-07-06T12:19:58Z

@Narsil let's land this PR first for the forwards, as I may do two separate PRs for backwards and let Tri pick the one that makes sense

Narsil · 2024-07-08T14:17:29Z

@tridao do you know what's missing to merge ?

Should I run the CI manually maybe (without backward since it's not implemented yet.)

lucidrains · 2024-07-08T15:34:18Z

@Narsil for backwards, just assert out or throw an error if soft capping is turned on

otherwise great job! @Shreya-Pathak too!

iamsaurabhgupt · 2024-07-08T18:21:11Z

great stuff @Narsil .
waiting for @tridao to merge.

iamsaurabhgupt · 2024-07-08T21:03:13Z

thanks a lot @tridao

iamsaurabhgupt · 2024-07-08T21:08:46Z

The released build wheels are still dated May 27.
I think there must be some build pipeline that takes time to update release assets?

turboderp · 2024-07-09T03:51:49Z

flash_attn/flash_attn_interface.py

@@ -639,6 +659,7 @@ def backward(ctx, dout, *args):
            ctx.softmax_scale,
            ctx.causal,
            ctx.window_size,
+            ctx.softcap


Small typo here, missing a comma

turboderp · 2024-07-09T03:52:03Z

flash_attn/flash_attn_interface.py

@@ -556,6 +572,7 @@ def backward(ctx, dout, *args):
            ctx.softmax_scale,
            ctx.causal,
            ctx.window_size,
+            ctx.softcap


Small typo here, missing a comma

this for me too. there is a second missing comma further down. also my first compile produced gibberish, am trying again with updated cutlass. it has successfully compiled from the same folder before.

turboderp · 2024-07-09T04:04:49Z

It compiled though (finally) and fixing the typos above everything appears to be working. Didn't test bwd, but inference is correct on Gemma2 models and there's no noticeable overhead. Thanks for this. 🥇

Oxi84 · 2024-07-12T10:00:10Z

I installed via pip, the version flash-attn 2.6.0.post1 .

It still does not fix the problem.

Is this fix included in the pip, or there is another way to install it?

foreverlms · 2024-07-15T01:34:20Z

Glad to have this question answered:
If Is_softcap is False, where is the scaling of QK^T performed?

To the line of code

Narsil marked this pull request as draft July 3, 2024 19:40

Narsil mentioned this pull request Jul 3, 2024

Logit soft-capping #1016

Closed

Narsil force-pushed the softcapping branch from d6de1c5 to 0e61360 Compare July 6, 2024 01:15

Softcap v2 (fwd only).

963cfc2

Narsil force-pushed the softcapping branch from 0e61360 to 963cfc2 Compare July 6, 2024 01:17

Some missing interface + remove overrides in tests.

c0fa65c

turboderp mentioned this pull request Jul 8, 2024

Gemma 2 support turboderp/exllamav2#532

Open

Narsil marked this pull request as ready for review July 8, 2024 07:25

tridao merged commit 8f873cc into Dao-AILab:main Jul 8, 2024

turboderp reviewed Jul 9, 2024

View reviewed changes

sparsh35 mentioned this pull request Jul 9, 2024

[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer vllm-project/vllm#6051

Merged

ArthurZucker mentioned this pull request Jul 10, 2024

[Gemma2] Support FA2 softcapping huggingface/transformers#31887

Merged

Ph0rk0z mentioned this pull request Jul 10, 2024

Build flash-attn takes a lot of time #1038

Open

tridao mentioned this pull request Aug 28, 2024

A question about softcap. #1181

Closed

[WIP] First draft for softcapping. #1025

[WIP] First draft for softcapping. #1025

Conversation

Narsil commented Jul 3, 2024 • edited Loading

tridao commented Jul 3, 2024

Narsil commented Jul 3, 2024

tridao commented Jul 3, 2024

tridao commented Jul 3, 2024 • edited Loading

Narsil commented Jul 4, 2024 • edited Loading

tridao commented Jul 4, 2024

Narsil commented Jul 4, 2024 • edited Loading

Narsil commented Jul 4, 2024

lucidrains commented Jul 4, 2024

lucidrains commented Jul 4, 2024 • edited Loading

lucidrains commented Jul 4, 2024

Narsil commented Jul 4, 2024

lucidrains commented Jul 4, 2024

lucidrains commented Jul 4, 2024 • edited Loading

Narsil commented Jul 5, 2024 • edited Loading

lucidrains commented Jul 5, 2024

Narsil commented Jul 5, 2024

lucidrains commented Jul 5, 2024

lucidrains commented Jul 5, 2024 • edited Loading

Shreya-Pathak commented Jul 5, 2024 • edited Loading

lucidrains commented Jul 5, 2024 • edited Loading

lucidrains commented Jul 5, 2024 • edited Loading

Narsil commented Jul 5, 2024

lucidrains commented Jul 5, 2024 • edited Loading

lucidrains commented Jul 5, 2024

Shreya-Pathak commented Jul 5, 2024

tridao commented Jul 5, 2024

Narsil commented Jul 6, 2024

lucidrains commented Jul 6, 2024

Narsil commented Jul 6, 2024

lucidrains commented Jul 6, 2024 • edited Loading

Narsil commented Jul 8, 2024

lucidrains commented Jul 8, 2024

iamsaurabhgupt commented Jul 8, 2024

iamsaurabhgupt commented Jul 8, 2024

iamsaurabhgupt commented Jul 8, 2024

turboderp Jul 9, 2024

Choose a reason for hiding this comment

turboderp Jul 9, 2024

Choose a reason for hiding this comment

Ph0rk0z Jul 9, 2024

Choose a reason for hiding this comment

turboderp commented Jul 9, 2024

Oxi84 commented Jul 12, 2024

foreverlms commented Jul 15, 2024

Narsil commented Jul 3, 2024 •

edited

Loading

tridao commented Jul 3, 2024 •

edited

Loading

Narsil commented Jul 4, 2024 •

edited

Loading

Narsil commented Jul 4, 2024 •

edited

Loading

lucidrains commented Jul 4, 2024 •

edited

Loading

lucidrains commented Jul 4, 2024 •

edited

Loading

Narsil commented Jul 5, 2024 •

edited

Loading

lucidrains commented Jul 5, 2024 •

edited

Loading

Shreya-Pathak commented Jul 5, 2024 •

edited

Loading

lucidrains commented Jul 5, 2024 •

edited

Loading

lucidrains commented Jul 5, 2024 •

edited

Loading

lucidrains commented Jul 5, 2024 •

edited

Loading

lucidrains commented Jul 6, 2024 •

edited

Loading