-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Encoder-decoder model support and T5 Model support #3117
base: main
Are you sure you want to change the base?
Conversation
Thanks @js8544, will review |
Thanks @js8544 ! Taking a look |
The issue of NaN may arise due to numerical overflow when using FP16 for computation in these models(T5-Large). |
T5 enc/dec example file; linting/formatting
Small PR for debug print statements
@afeldman-nm left a few specific comments. My overarching reaction is as follows: The implementation does not modify any of the So my question is:
|
I don't mind having two separate block tables. I chose to have them in one because it would make minimal change to existing components. In fact it only adds a couple of ifs in Also cc @zhuohan123 for ideas and comments. |
Note: I was wrong about this breaking decoders that was a silly comment @js8544 Its a good point. Perhaps we can make the indexing into the block tables more transparent about what is going on to make the code easier to follow instead. We will think more about it |
I agree. The current indexing scheme (by computing paddings each time) is painful and hard to read. |
fix _make_tensor_with_pad args change which broke decoder scenarios
Yeah I noticed that transformers's original implementation of T5 also suffers from this. Using BF16 or FP32 should work. |
@zhuohan123 @WoosukKwon Would you guys mind taking a look at this PR? T5 seems to be working now. |
FYI I think this PR has some conflicts with recent changes to the main branch. I am looking at resolving them. This PR was previously passing all of the tests so I am hoping once the recent conflicts are resolved then we will be ready to merge @zhuohan123 @WoosukKwon |
Hi, Can this feature be extended to T5-3b/11b and flanT5-xl/xxl models as well. We observe some errors with both these cases. |
Hello, Any update on fixing the conflict and merge? |
Hello @Abineshik yes things are moving apace. Thanks for checking in. I determined it is probably for the best for encoder decoder models to have separate blocktables for self- and cross-attention KVs. As opposed to packing all KVs into a single blocktable, which had been the existing approach. Having two blocktables makes it easier to work with the vLLM Attention wrapper when implementing Encoder self-attention and Decoder cross-attention. The indexing arithmetic becomes cleaner. I am almost finished making this change, then I will update the PR. |
@afeldman-nm how is the change you are working on going? |
Work is still ongoing but hope to finish soon! |
@afeldman-nm do you have an ETA for this? I can probably help if needed. Thanks. |
Update -> supporting encoder-decoder in a single PR was very difficult as it touches all subsystems in vLLM, which caused a lot of issues in our development cycle and many many issues rebasing We have split this work out into a series of PRs which are being merged:
We could use help expanding model support soon! |
@robertgshaw2-neuralmagic |
@afeldman-nm Thanks a lot for initiating this PR. How much time it will take to merge this PR to support T5 based models like MADLAD-400? Thanks |
FYI encoder/decoder support has landed. RFC #7366 overviews the next steps, in terms of which models to add & what features have yet to be supported with encoder/decoder models. @yugaljain1999 to answer your question, T5 relative position encoding requires custom attention bias which is currently unsupported. There is a section on custom attention bias in the RFC. This is the only "hard blocker" to T5 support. Still waiting to see who will pick up the custom bias workstream. |
Add support for encoder-decoder models and T5 as an example. There are mainly two differences between enc-dec and decoder-only models.
T5 has a custom bias in its attention, so I also added a custom pias argument to the cuda kernel.
I can run the t5-small model successfully, but the outputs of t5-large and larger models become NaN at some point. I am still digging into this issue. Also, t5-small is only 20% faster than transformers on my machine. So it's likely that there a lot of room for performance improvement.
FIX #8036