Separate attention backends #3005

WoosukKwon · 2024-02-23T07:02:37Z

This PR refactors the attention layer. Specifically, it separates the code paths for Ampere or more recent NVIDIA GPUs (which can directly use FlashAttention) and other GPUs, so that the code for the former becomes much simpler. This PR will also bring some performance improvements for ALiBi models, since we now directly call FlashAttention instead of using xformers in the middle.

WoosukKwon · 2024-02-23T07:08:36Z

@zhuohan123 What do you think about this design? Please note that while I used flash_attn for now, but this will be replaced with FlashInfer.

vllm/model_executor/layers/attention/flash.py

zhuohan123

In general the refactor LGTM. My only small concern is on the learning cost of AttentionFactory since it does not completely behave like a torch nn.module. I think this can add difficulty for people adding new models.

vllm/model_executor/layers/attention/__init__.py

vllm/model_executor/layers/attention/non_flash.py

vllm/model_executor/layers/attention/paged_attn.py

vllm/model_executor/layers/attention/__init__.py

chenxu2048 · 2024-02-26T03:17:10Z

vllm/model_executor/layers/attention/flash.py

+                    alibi_slopes=self.alibi_slopes,
+                )
+            else:
+                # prefix-enabled attention


Prefix-enabled attention and decoding part is the same as that in non_flash.py. Could we move them into BaseAttention? Just like:

class BaseAttention(nn.Module): def forward(self, ...): if input_metadata.is_prompt: if ...: self._do_prompt_attention() else: # prefix-enabled attention else: # Decoding run. def _do_prompt_attention(self): # use xformers or flash_attn here

Hi @chenxu2048, thanks for your input. I intentionally avoided this design since some attention implementation may not follow the structure. For example, an attention kernel may process the prompt attention and prefix-enabled attention together. In terms of flexibility, I think the current structure is preferable.

Thanks for your explanation.

WoosukKwon · 2024-03-01T07:57:39Z

@zhuohan123 PTAL. Please note that I intentionally didn't make changes to other models than Llama.

zhuohan123

Thanks! In general LGTM. Will you change all other model files when you merge the PR?

Yard1 · 2024-03-06T21:14:06Z

vllm/model_executor/layers/attention/backends/flash_attn.py

+        if input_metadata.is_prompt:
+            # Prompt run.
+            if (key_cache is None or value_cache is None
+                    or input_metadata.block_tables.numel() == 0):
+                # normal attention
+                query = query.unflatten(0, (batch_size, seq_len))
+                key = key.unflatten(0, (batch_size, seq_len))
+                value = value.unflatten(0, (batch_size, seq_len))
+                output = flash_attn_func(
+                    query,
+                    key,
+                    value,
+                    softmax_scale=self.scale,
+                    causal=True,
+                    window_size=self.sliding_window,
+                    alibi_slopes=self.alibi_slopes,
+                )
+            else:
+                # prefix-enabled attention
+                output = PagedAttentionImpl.forward_prefix(
+                    query,
+                    key,
+                    value,
+                    key_cache,
+                    value_cache,
+                    input_metadata,
+                    self.num_heads,
+                    self.num_kv_heads,
+                    self.alibi_slopes,
+                )
+        else:
+            # Decoding run.
+            output = PagedAttentionImpl.forward_decode(
+                query,
+                key_cache,
+                value_cache,
+                input_metadata,
+                self.num_kv_heads,
+                self.scale,
+                self.alibi_slopes,
+            )
+
+        # Reshape the output tensor.
+        return output.view(batch_size, seq_len, hidden_size)


I would still suggest separating this out into private methods (_forward_decode, _forward_prefill etc.) so that forward can just decide which method to dispatch.

Thanks for your inputs! Actually, I intentionally avoided the design you proposed to ensure the flexibility in implementing the attention backends. As you pointed out, an attention backend performs 4 tasks: 1) storing the input KV tensors into the KV cache, 2) compute prefills, 3) compute prefills with prefixes, and 4) compute decodes. Currently, the two attention backends (FlashAttentionBackend and XFormersBackend) have a kernel for each task. However, this may not be necessary true in the future. For example, depending on the kernel implementation, one can compute prefills with and without prefixes (2&3) at the same time. For anther example, an attention kernel in TRT-LLM stores KV cache while computing decodes (1&4). These can be even more complicated if we implement something like Cascade inference. Hence, I believe we shouldn't fix a certain structure for the attention backends.

@Yard1 What do you think about this?

I agree we should not make them part of public API, but they can be done as private APIs for the backends that do have that distinction. Basically we should try to modularize the forward method if possible as it makes it easier to read and test.

Got it. First, I believe the current implementation is easy to read; XFormersBackend is essentially the same as the current main branch and FlashAttentionBackend is simpler than that. Particularly for FlashAttentionBackend, I believe the implementation in this PR is very easy to understand.

That being said, I do agree that modularizing the backends will make it easy to test them. However, since this PR has already been delayed quite a bit, let's merge the PR and do modularization in the next PR.

Yard1 · 2024-03-06T21:16:19Z

vllm/model_executor/layers/attention/ops/paged_attn.py

+        if use_v1:
+            # Run PagedAttention V1.
+            ops.paged_attention_v1(
+                output,
+                query,
+                key_cache,
+                value_cache,
+                num_kv_heads,
+                scale,
+                input_metadata.block_tables,
+                input_metadata.context_lens,
+                block_size,
+                input_metadata.max_context_len,
+                alibi_slopes,
+                input_metadata.kv_cache_dtype,
+            )
+        else:
+            # Run PagedAttention V2.
+            assert _PARTITION_SIZE % block_size == 0
+            tmp_output = torch.empty(
+                size=(num_seqs, num_heads, max_num_partitions, head_size),
+                dtype=output.dtype,
+                device=output.device,
+            )
+            exp_sums = torch.empty(
+                size=(num_seqs, num_heads, max_num_partitions),
+                dtype=torch.float32,
+                device=output.device,
+            )
+            max_logits = torch.empty_like(exp_sums)
+            ops.paged_attention_v2(
+                output,
+                exp_sums,
+                max_logits,
+                tmp_output,
+                query,
+                key_cache,
+                value_cache,
+                num_kv_heads,
+                scale,
+                input_metadata.block_tables,
+                input_metadata.context_lens,
+                block_size,
+                input_metadata.max_context_len,
+                alibi_slopes,
+                input_metadata.kv_cache_dtype,
+            )
+        return output


ditto as in previous comment (_forward_decode_v1, _forward_decode_v2)

ditto. Let's do it in the next PR.

This reverts commit 2daf23a.

zhaoyang-star · 2024-03-11T03:05:05Z

vllm/model_executor/layers/attention/backends/flash_attn.py

+                )
+        else:
+            # Decoding run.
+            output = PagedAttentionImpl.forward_decode(


I am just curious that why not use flash_attn_with_kvcache ? The kernel is faster than paged_attention_kernel. More benchmark details can be found in #2744

WoosukKwon added 9 commits February 23, 2024 02:57

Add attention_backends

a40b2c9

Move

6b6f7c7

Remove if

f2b888c

Remove if

7f4422c

Attention

1d9dc99

Minor

534d0f8

Minor

404022a

Rename

194df2f

Add flash-attn

a6910ea

WoosukKwon requested a review from zhuohan123 February 23, 2024 07:07

sighingnow reviewed Feb 23, 2024

View reviewed changes

vllm/model_executor/layers/attention/flash.py Outdated Show resolved Hide resolved

zhuohan123 reviewed Feb 25, 2024

View reviewed changes

chenxu2048 reviewed Feb 26, 2024

View reviewed changes

Address review

346b1b7

sighingnow mentioned this pull request Feb 27, 2024

Introduce flash-attn (>= 2.5.0). #3010

Closed

WoosukKwon added 8 commits February 28, 2024 22:38

Merge branch 'main' into refactor-attn

05579fa

Move

da115dd

Move

19ecd4d

Minor

5b8e8c7

Rename

ef8ace1

Fix attention

6490fb4

Minor

3baebac

Minor

6a81692

WoosukKwon requested a review from zhuohan123 February 29, 2024 00:42

WoosukKwon marked this pull request as ready for review February 29, 2024 00:47

WoosukKwon added 2 commits February 29, 2024 00:49

Minor

963a2c7

Add comment

38baed7

zhuohan123 approved these changes Mar 1, 2024

View reviewed changes

WoosukKwon added 2 commits March 6, 2024 21:04

Add FlashInfer wheels to vLLM

ed1ab56

Minor

73aedbd

Yard1 reviewed Mar 6, 2024

View reviewed changes

WoosukKwon added 8 commits March 6, 2024 21:58

maybe fix packaging

0214afd

Add gitignore

12ea60d

Revert

b460c21

revert

4ffa89f

Copy after build

6ba0e70

Binary distribution

974db99

yapf

f72560c

Fix a bug for FP32

0b8ac9e

WoosukKwon merged commit 2daf23a into main Mar 7, 2024
20 of 23 checks passed

WoosukKwon deleted the refactor-attn branch March 7, 2024 10:03

This was referenced Mar 7, 2024

[Minor fix] Include flash_attn in docker image #3254

Closed

flash_attn missing from Docker image #3255

Closed

AlpinDale mentioned this pull request Mar 7, 2024

feat: refactor attention backend into FA2 and xFormers PygmalionAI/aphrodite-engine#291

Closed

1 task

mgoin mentioned this pull request Mar 7, 2024

Issues with installing from source due to flash-attn subprocess install #3265

Closed

AlpinDale mentioned this pull request Mar 8, 2024

fix: error due to FA2 when building #3266

Closed

WoosukKwon added a commit that referenced this pull request Mar 8, 2024

Revert "Separate attention backends (#3005)"

5c6c40f

This reverts commit 2daf23a.

chenxu2048 mentioned this pull request Mar 8, 2024

[FIX] Make flash_attn optional #3269

Merged

Qubitium mentioned this pull request Mar 8, 2024

Regression in llama model inference due to #3005 #3282

Closed

grandiose-pizza pushed a commit to grandiose-pizza/vllm-jais that referenced this pull request Mar 9, 2024

adapted to PR vllm-project#3005

8fd0aec

grandiose-pizza mentioned this pull request Mar 9, 2024

Added support for Jais models #3183

Merged

zhaoyang-star reviewed Mar 11, 2024

View reviewed changes

tdoublep mentioned this pull request Mar 13, 2024

FlashAttentionBackend only supports head sizes supported by xformers #3359

Closed

hongxiayang mentioned this pull request Mar 21, 2024

[ROCm] [Hardware][AMD] Remove xformer patches and ray issue fix #3558

Closed

dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024

Separate attention backends (vllm-project#3005)

63e03d2

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

Separate attention backends (vllm-project#3005)

7184f4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate attention backends #3005

Separate attention backends #3005

WoosukKwon commented Feb 23, 2024 •

edited

Loading

WoosukKwon commented Feb 23, 2024

zhuohan123 left a comment

chenxu2048 Feb 26, 2024

WoosukKwon Feb 29, 2024

chenxu2048 Mar 6, 2024

WoosukKwon commented Mar 1, 2024

zhuohan123 left a comment

Yard1 Mar 6, 2024

WoosukKwon Mar 6, 2024

Yard1 Mar 6, 2024

WoosukKwon Mar 7, 2024

Yard1 Mar 6, 2024

WoosukKwon Mar 7, 2024

zhaoyang-star Mar 11, 2024

Separate attention backends #3005

Separate attention backends #3005

Conversation

WoosukKwon commented Feb 23, 2024 • edited Loading

WoosukKwon commented Feb 23, 2024

zhuohan123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WoosukKwon commented Mar 1, 2024

zhuohan123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WoosukKwon commented Feb 23, 2024 •

edited

Loading