Add a inplace concat custom op based on CUDA VMM API (resubmitted) #9320

lszxb · 2024-10-28T04:01:30Z

PR types

Performance optimization

PR changes

Others

Description

这一PR尝试为当前的大模型推理过程增加基于CUDA VMM API的inplace concat支持（原理类似于vAttention），从而避免在每一个解码步都复制一次整个KV Cache。
该功能暂时只实现了自定义算子，未来还需要增加相关的pass以自动适配其他模型。
目前这一PR在llama模型上应用了这一方案，在3072 input+1024 output的情况下大约有10%的提升。

目前主要的思路是：

使用一种特殊的Tensor，其显存由VMM API分配，这种Tensor使用特殊的phi::Allocation，在创建时预留大量的虚拟地址空间，可以在必要时分配物理页映射到虚拟地址空间。
为了兼容剩余的调用，cache的shape为batch x seq_len x num_head x head_dim，但由于状态在cache的尾部追加，cache的内存布局应该是seq_len x batch x num_head x head_dim。
vtensor_reserve_one_token自定义算子的语义大致如下：
如果key_cache不是VTensor，则新分配一个VTensor，并将原先key_cache中的数据复制到这个新的VTensor中。然后使用VTensor的扩展机制，在尾部预留新的一个token的空间，并将key_states复制到这个新的空间中。
如果key_cache是VTensor，直接使用VTensor的扩展机制，在尾部预留新的一个token的空间，并将key_states复制到这个新的空间中。

目前可能存在的问题：

仅支持每次追加1个token的空间。
目前分配的虚拟地址空间大小和block大小为定值（1GiB与32MiB），可能暴露相关的API给用户进行调整会更好？
输入和输出的key_cache共享同一块空间，在某些情况下可能会产生冲突。
该方法依赖于每个step使用的kv cache是同一个Tensor，若有某些其他操作改变了kv cache的Tensor（比如说clone到另一个Tensor），则会导致失效，因此也需要配合这个PR的优化才可使用（assign_out_操作会导致复制）。
通过该算子分配的显存无法使用现有的Allocator进行统一管理。

本PR还包括了以下两个PR的内容：

… loop, avoiding redundant kv cache copy

…amework

…ness

paddle-bot · 2024-10-28T04:01:34Z

Thanks for your contribution!

yuanlehome · 2024-10-28T04:05:18Z

csrc/gpu/pass/apply_vtensor_concat_pass.cc

@@ -0,0 +1,71 @@
+#include "paddle/extension.h"


yuanlehome · 2024-10-28T04:05:43Z

csrc/gpu/vtensor.cu

+    const paddle::Tensor& append_state,
+    bool transposed_input
+) {
+    // std::cout << "vtensor_reserve_one_token 1 " << (uintptr_t)cache_transposed.data() << std::endl;


类似注释都可以给删掉

yuanlehome · 2024-10-28T04:06:17Z

csrc/setup_hip.py

+            "./gpu/pass/remove_assign_out_pass.cc",
+            "./gpu/pass/apply_vtensor_concat_pass.cc",
+            "./gpu/vtensor.cu", # TODO: this haven't tested with hip


这个文件不需要更改，先暂时只在gpu下使用就行

好的，这几行删掉了

yuanlehome · 2024-10-28T04:08:06Z

llm/predict/predictor.py

+        if is_paddlenlp_ops_available():
+            import paddlenlp_ops
+            inference_config.enable_custom_passes([
+                "remove_assign_out_pass", # remove the assign_out_ op at the end of while loop
+                "apply_vtensor_concat_pass", # replace concat op with vtensor implementation
+            ])


Suggested change

if is_paddlenlp_ops_available():

import paddlenlp_ops

inference_config.enable_custom_passes([

"remove_assign_out_pass", # remove the assign_out_ op at the end of while loop

"apply_vtensor_concat_pass", # replace concat op with vtensor implementation

])

try:

import remove_assign_out_pass, apply_vtensor_concat_pass from paddlenlp_ops

inference_config.enable_custom_passes([

"remove_assign_out_pass", # remove the assign_out_ op at the end of while loop

"apply_vtensor_concat_pass", # replace concat op with vtensor implementation

])

except:

pass

这样修改吧

paddlenlp_ops里没有pass的对象，我换成了新加的算子vtensor_reserve_one_token

codecov · 2024-10-28T04:34:25Z

Codecov Report

Attention: Patch coverage is 28.57143% with 5 lines in your changes missing coverage. Please review.

Project coverage is 52.24%. Comparing base (81f5ab5) to head (d5a9393).
Report is 2 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/generation/utils.py	0.00%	4 Missing ⚠️
paddlenlp/generation/logits_process.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9320      +/-   ##
===========================================
- Coverage    52.92%   52.24%   -0.69%     
===========================================
  Files          661      671      +10     
  Lines       107069   109655    +2586     
===========================================
+ Hits         56670    57288     +618     
- Misses       50399    52367    +1968

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lszxb added 17 commits August 26, 2024 13:33

add a PIR pass to remove the pd_op.assign_out_ op at the end of while…

b1ebe45

… loop, avoiding redundant kv cache copy

avoid cuda sync in postprocess of LLM decoding

16b554b

Revert workarounds about full_with_tensor as it will fix by paddle fr…

5ece22d

…amework

add vtensor custom operator and support for llama

31047f7

add transform pass for vtensor and update vtensor impl

1992697

Merge branch 'develop' into fix_remove_assign_out_in_while_loop

21807cd

update remove_assign_out_pass, now it only trigger with cf.yield

20aa163

Merge branch 'fix_remove_assign_out_in_while_loop' into add_vtensor_op

b645e16

comment out debug info in apply_vtensor_concat_pass

9abb6a7

add apply_vtensor_concat_pass to setup.py

dfe0ce3

update vtensor op: add sync before unmap

c174b8f

enable apply_vtensor_concat_pass in predictor.py

b13e79d

update remove_assign_out_pass: add more constraints to improve robust…

7058e45

…ness

Merge branch 'fix_remove_assign_out_in_while_loop' into add_vtensor_op

9bca022

Merge branch 'develop' into fix_remove_assign_out_in_while_loop

a95f16e

Merge branch 'develop' into fix_avoid_cuda_sync

53e229a

Merge branch 'fix_remove_assign_out_in_while_loop' into add_vtensor_op

b3b9ec3

paddle-bot bot added the contributor label Oct 28, 2024

paddle-bot bot assigned KB-Ding Oct 28, 2024

yuanlehome mentioned this pull request Oct 28, 2024

Add a inplace concat custom op based on CUDA VMM API #9126

Closed

yuanlehome reviewed Oct 28, 2024

View reviewed changes

Merge branch 'fix_avoid_cuda_sync' into all_fix

85e0654

yuanlehome reviewed Oct 28, 2024

View reviewed changes

lszxb added 4 commits October 28, 2024 12:10

Add copyright for new files

64b75da

Remove comments

018e166

Format codes

ca1aaf7

disable vtensor on hip

31b2dcd

lszxb added 2 commits October 28, 2024 12:45

format python code by autopep8

9fdda07

Lint code

d5a9393

yuanlehome requested review from ZHUI and DesmonDay October 28, 2024 05:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a inplace concat custom op based on CUDA VMM API (resubmitted) #9320

Add a inplace concat custom op based on CUDA VMM API (resubmitted) #9320

lszxb commented Oct 28, 2024 •

edited

Loading

paddle-bot bot commented Oct 28, 2024

yuanlehome Oct 28, 2024

lszxb Oct 28, 2024

yuanlehome Oct 28, 2024

yuanlehome Oct 28, 2024

lszxb Oct 28, 2024

yuanlehome Oct 28, 2024

lszxb Oct 28, 2024

codecov bot commented Oct 28, 2024 •

edited

Loading

Add a inplace concat custom op based on CUDA VMM API (resubmitted) #9320

Are you sure you want to change the base?

Add a inplace concat custom op based on CUDA VMM API (resubmitted) #9320

Conversation

lszxb commented Oct 28, 2024 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Oct 28, 2024

yuanlehome Oct 28, 2024

Choose a reason for hiding this comment

lszxb Oct 28, 2024

Choose a reason for hiding this comment

yuanlehome Oct 28, 2024

Choose a reason for hiding this comment

yuanlehome Oct 28, 2024

Choose a reason for hiding this comment

lszxb Oct 28, 2024

Choose a reason for hiding this comment

yuanlehome Oct 28, 2024

Choose a reason for hiding this comment

lszxb Oct 28, 2024

Choose a reason for hiding this comment

codecov bot commented Oct 28, 2024 • edited Loading

Codecov Report

lszxb commented Oct 28, 2024 •

edited

Loading

codecov bot commented Oct 28, 2024 •

edited

Loading