Add Encoder-decoder model support and T5 Model support #3117

js8544 · 2024-02-29T15:51:13Z

Add support for encoder-decoder models and T5 as an example. There are mainly two differences between enc-dec and decoder-only models.

The KVCache blocks of prompt and generated tokens are seperated.
The generated tokens always start with <decoder_start_token_id>, and this token is added during the prompt phase.

T5 has a custom bias in its attention, so I also added a custom pias argument to the cuda kernel.

I can run the t5-small model successfully, but the outputs of t5-large and larger models become NaN at some point. I am still digging into this issue. Also, t5-small is only 20% faster than transformers on my machine. So it's likely that there a lot of room for performance improvement.

FIX #8036

robertgshaw2-neuralmagic · 2024-02-29T16:01:39Z

Thanks @js8544, will review

afeldman-nm · 2024-02-29T16:24:47Z

Thanks @js8544 ! Taking a look

seanxcwang · 2024-03-01T09:38:58Z

The issue of NaN may arise due to numerical overflow when using FP16 for computation in these models(T5-Large).

afeldman-nm · 2024-03-01T22:00:31Z

@js8544 please review this PR against your feature branch

js8544#1

it adds a t5 encoder/decoder example file, and also finishes merging upstream main into your PR.

T5 enc/dec example file; linting/formatting

Small PR for debug print statements

robertgshaw2-neuralmagic · 2024-03-05T01:14:15Z

@afeldman-nm left a few specific comments. My overarching reaction is as follows:

The implementation does not modify any of the input_metadata or kv_cache allocation logic by putting the cross attention kvs and self-attention kvs in the same blocktable. This makes the attention and prepare_decode functions much more complicated because we have to keep around the block tables in the model + constantly pad things

So my question is:

Should we continue down this path or should we adjust and keep 2 separate block tables for the cross and self attention separately? Then make KV cache allocations to each of these explicitly and pass around the 2 different block tables explicitly

js8544 · 2024-03-05T05:13:32Z

I don't mind having two separate block tables. I chose to have them in one because it would make minimal change to existing components. In fact it only adds a couple of ifs in ModelRunner and all other changes are limited to the enc-dec implementation. IIUC having them separate would also require changes to Scheduler, Sequence, InputMetadata.

Also cc @zhuohan123 for ideas and comments.

robertgshaw2-neuralmagic · 2024-03-05T14:04:05Z

Note: I was wrong about this breaking decoders that was a silly comment

@js8544 Its a good point. Perhaps we can make the indexing into the block tables more transparent about what is going on to make the code easier to follow instead. We will think more about it

js8544 · 2024-03-05T14:58:21Z

I agree. The current indexing scheme (by computing paddings each time) is painful and hard to read.

…l_runner.py

fix _make_tensor_with_pad args change which broke decoder scenarios

js8544 · 2024-03-12T15:50:36Z

The issue of NaN may arise due to numerical overflow when using FP16 for computation in these models(T5-Large).

Yeah I noticed that transformers's original implementation of T5 also suffers from this. Using BF16 or FP32 should work.

js8544 · 2024-03-12T15:51:17Z

@zhuohan123 @WoosukKwon Would you guys mind taking a look at this PR? T5 seems to be working now.

afeldman-nm · 2024-03-12T17:06:06Z

FYI I think this PR has some conflicts with recent changes to the main branch. I am looking at resolving them.

This PR was previously passing all of the tests so I am hoping once the recent conflicts are resolved then we will be ready to merge @zhuohan123 @WoosukKwon

VarnithChordia · 2024-03-23T22:45:43Z

Hi,

Can this feature be extended to T5-3b/11b and flanT5-xl/xxl models as well. We observe some errors with both these cases.

Abineshik · 2024-03-28T11:57:35Z

FYI I think this PR has some conflicts with recent changes to the main branch. I am looking at resolving them.

This PR was previously passing all of the tests so I am hoping once the recent conflicts are resolved then we will be ready to merge @zhuohan123 @WoosukKwon

Hello, Any update on fixing the conflict and merge?

afeldman-nm · 2024-03-28T15:40:10Z

Hello @Abineshik yes things are moving apace. Thanks for checking in.

I determined it is probably for the best for encoder decoder models to have separate blocktables for self- and cross-attention KVs. As opposed to packing all KVs into a single blocktable, which had been the existing approach. Having two blocktables makes it easier to work with the vLLM Attention wrapper when implementing Encoder self-attention and Decoder cross-attention. The indexing arithmetic becomes cleaner.

I am almost finished making this change, then I will update the PR.

rohithkrn · 2024-05-13T01:02:21Z

@afeldman-nm how is the change you are working on going?

afeldman-nm · 2024-05-13T03:20:58Z

@afeldman-nm how is the change you are working on going?

Work is still ongoing but hope to finish soon!

rohithkrn · 2024-06-06T15:56:36Z

@afeldman-nm do you have an ETA for this? I can probably help if needed. Thanks.

robertgshaw2-neuralmagic · 2024-06-10T18:39:49Z

Update -> supporting encoder-decoder in a single PR was very difficult as it touches all subsystems in vLLM, which caused a lot of issues in our development cycle and many many issues rebasing

We have split this work out into a series of PRs which are being merged:

[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) #4837
[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) #4888
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) #4942

We could use help expanding model support soon!

Respaired · 2024-07-13T02:08:31Z

@robertgshaw2-neuralmagic
Will this work with the newer variants of T5? such as the Long T5 or UMT5. sounds like they have some architectural changes. (especially long T5 uses a slightly different attention mechanism)

yugaljain1999 · 2024-08-14T11:14:12Z

@afeldman-nm Thanks a lot for initiating this PR. How much time it will take to merge this PR to support T5 based models like MADLAD-400?
Is there anythinig missing?

Thanks

afeldman-nm · 2024-08-14T13:39:16Z

FYI encoder/decoder support has landed.

RFC #7366 overviews the next steps, in terms of which models to add & what features have yet to be supported with encoder/decoder models.

@yugaljain1999 to answer your question, T5 relative position encoding requires custom attention bias which is currently unsupported. There is a section on custom attention bias in the RFC. This is the only "hard blocker" to T5 support.

Still waiting to see who will pick up the custom bias workstream.

Jin Shang and others added 3 commits February 29, 2024 09:27

t5-small

dd82ba3

fix

f2fd579

lint

2fb6905

js8544 mentioned this pull request Feb 29, 2024

Adding support for encoder-decoder models, like T5 or BART #187

Closed

afeldman-nm added 3 commits March 1, 2024 15:35

T5 enc/dec example file; linting/formatting

be58c3b

native/vllm t5 comparison test

70837fd

merged upstream-main into enc_dec_t5

42a6e2b

afeldman-nm and others added 6 commits March 1, 2024 22:23

Merge branch 'upstream-main' into enc_dec_t5

e3fd30d

Merge pull request #1 from afeldman-nm/enc_dec_t5

db726e6

T5 enc/dec example file; linting/formatting

remove debug print statements

43e920e

silence warning; legacy=False for tokenizer; lint/format

431f014

Merge branch 'js8544_enc_dec_t5' into enc_dec_t5

37fcf99

Merge pull request #2 from afeldman-nm/enc_dec_t5

4bf056b

Small PR for debug print statements

esmeetu added the new model Requests to new models label Mar 2, 2024

fix _make_tensor_with_pad args change which broke decoder scenario

8a5060f

afeldman-nm and others added 4 commits March 5, 2024 12:52

fixed bug caused by non-handling of self.model_config is None in mode…

29d6f44

…l_runner.py

remove commented-out print statements

a4950ba

small cleanup

9c03760

Merge pull request #3 from afeldman-nm/enc_dec_t5

9f20ccf

fix _make_tensor_with_pad args change which broke decoder scenarios

ywang96 mentioned this pull request Mar 11, 2024

add aya-101 model #3318

Open

This was referenced Apr 2, 2024

[WIP] Upstream encoder/decoder support based on multiple blocktables neuralmagic/nm-vllm#161

Closed

[WIP] Upstream encoder/decoder support based on multiple blocktables afeldman-nm/vllm#3

Open

Whisper support #180

Open

simon-mo mentioned this pull request Apr 4, 2024

[Roadmap] vLLM Roadmap Q2 2024 #3861

Closed

65 tasks

laishzh mentioned this pull request Jun 19, 2024

[Model] Bert Embedding Model #5447

Closed

simon-mo mentioned this pull request Jun 25, 2024

[Roadmap] vLLM Roadmap Q3 2024 #5805

Closed

46 tasks

afeldman-nm mentioned this pull request Aug 9, 2024

[RFC]: Encoder/decoder models & feature compatibility #7366

Open

DarkLight1337 mentioned this pull request Sep 12, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

84 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Encoder-decoder model support and T5 Model support #3117

Add Encoder-decoder model support and T5 Model support #3117

js8544 commented Feb 29, 2024 •

edited by DarkLight1337

Loading

robertgshaw2-neuralmagic commented Feb 29, 2024

afeldman-nm commented Feb 29, 2024

seanxcwang commented Mar 1, 2024

afeldman-nm commented Mar 1, 2024

robertgshaw2-neuralmagic commented Mar 5, 2024 •

edited

Loading

js8544 commented Mar 5, 2024

robertgshaw2-neuralmagic commented Mar 5, 2024

js8544 commented Mar 5, 2024

js8544 commented Mar 12, 2024

js8544 commented Mar 12, 2024

afeldman-nm commented Mar 12, 2024

VarnithChordia commented Mar 23, 2024

Abineshik commented Mar 28, 2024

afeldman-nm commented Mar 28, 2024

rohithkrn commented May 13, 2024

afeldman-nm commented May 13, 2024

rohithkrn commented Jun 6, 2024

robertgshaw2-neuralmagic commented Jun 10, 2024 •

edited

Loading

Respaired commented Jul 13, 2024

yugaljain1999 commented Aug 14, 2024

afeldman-nm commented Aug 14, 2024 •

edited

Loading

Add Encoder-decoder model support and T5 Model support #3117

Are you sure you want to change the base?

Add Encoder-decoder model support and T5 Model support #3117

Conversation

js8544 commented Feb 29, 2024 • edited by DarkLight1337 Loading

robertgshaw2-neuralmagic commented Feb 29, 2024

afeldman-nm commented Feb 29, 2024

seanxcwang commented Mar 1, 2024

afeldman-nm commented Mar 1, 2024

robertgshaw2-neuralmagic commented Mar 5, 2024 • edited Loading

js8544 commented Mar 5, 2024

robertgshaw2-neuralmagic commented Mar 5, 2024

js8544 commented Mar 5, 2024

js8544 commented Mar 12, 2024

js8544 commented Mar 12, 2024

afeldman-nm commented Mar 12, 2024

VarnithChordia commented Mar 23, 2024

Abineshik commented Mar 28, 2024

afeldman-nm commented Mar 28, 2024

rohithkrn commented May 13, 2024

afeldman-nm commented May 13, 2024

rohithkrn commented Jun 6, 2024

robertgshaw2-neuralmagic commented Jun 10, 2024 • edited Loading

Respaired commented Jul 13, 2024

yugaljain1999 commented Aug 14, 2024

afeldman-nm commented Aug 14, 2024 • edited Loading

js8544 commented Feb 29, 2024 •

edited by DarkLight1337

Loading

robertgshaw2-neuralmagic commented Mar 5, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Jun 10, 2024 •

edited

Loading

afeldman-nm commented Aug 14, 2024 •

edited

Loading