Optimize HF text generation #4814

oobabooga · 2023-12-05T02:45:23Z

First change: Decode only the new tokens as they are generated during streaming instead of the entire sequence. This makes a difference when the context is large (like 40,000 tokens).

Here is a test with ExLlama_HF and a 13b model. The prompt is cached to make the difference more pronounced.

Before:

Output generated in 10.28 seconds (48.56 tokens/s, 499 tokens, context 3550, seed 1197861167)
Output generated in 10.27 seconds (48.59 tokens/s, 499 tokens, context 3550, seed 1463481238)
Output generated in 10.30 seconds (48.43 tokens/s, 499 tokens, context 3550, seed 1587775594)

After:

Output generated in 9.69 seconds (51.50 tokens/s, 499 tokens, context 3550, seed 909082919)
Output generated in 9.71 seconds (51.41 tokens/s, 499 tokens, context 3550, seed 221584078)
Output generated in 9.72 seconds (51.36 tokens/s, 499 tokens, context 3550, seed 105310110)

Second change: Restrict the UI streaming updates to 5 per second when --listen or --share are provided. I find that this makes the UI a lot more responsive over a network.

Optimize HF text generation

c8e46e7

oobabooga merged commit 9edb193 into dev Dec 5, 2023

oobabooga deleted the optimize branch December 5, 2023 03:16

oobabooga added a commit that referenced this pull request Dec 6, 2023

Minor bug fix after #4814

6430aca

oobabooga mentioned this pull request Feb 4, 2024

Remove non-HF ExLlamaV2 loader #5431

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize HF text generation #4814

Optimize HF text generation #4814

oobabooga commented Dec 5, 2023 •

edited

Loading

Optimize HF text generation #4814

Optimize HF text generation #4814

Conversation

oobabooga commented Dec 5, 2023 • edited Loading

oobabooga commented Dec 5, 2023 •

edited

Loading