Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode #4628
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode #4628
Changes from 21 commits
0eb1ab1
4590b46
3bfbdf7
993a4ae
b4d9dae
eb2d18e
5e3d11d
89f0e2c
72e704b
f9770ed
f1849f7
88425a3
74a8eeb
dcbbfd6
4302848
d739312
e5017e2
5ad175a
543dc3b
11b7347
b5db4be
f53d03e
6fb1b6d
e05ff79
8f685dd
0f8e7a1
cf275a1
0ab32ee
c421f1f
815efc2
901b369
b2d9895
df16a6b
64a24cb
dc4e7ef
9774919
aeb0df6
8a72dcf
aaddbad
4aa2069
e61bd38
b2484df
0f4f796
3dca2f0
7853235
8316bc3
d5348f1
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: can we make it not kwarg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I'm following other backends such as
vllm/vllm/attention/backends/flash_attn.py
Line 120 in 845a3f2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we discussed this before, but what's the overhead of this call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For llama7b on A100, the shape of query is [256, 12, 64], and this line takes ~0.037ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is that? Can you comment? (also is it fundamental?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only happens during the profiling phase, where the cache is initialized (not paged). We use flash attention for the profile run.