Continuous batching #1333

andreapiso · 2023-07-06T23:58:24Z

Recently, a lot of benchmarks point to the fact that if you want to serve your models behind an API, continuous batching grants higher throughput and lower latency compared to static batching. Some examples of systems that implement continous batching:

text-generation-inference from huggingface: https://github.com/huggingface/text-generation-inference
vLLM (which also include an inference engine) https://github.com/vllm-project/vllm
Ray from the next 2.6 version

In order to enable continuous batching, it is necessary to be able to:

add requests to an existing running batch, if there are enough resources to take it (compared to static batching where requests need to be submitted all together)
remove a request early from the batch when it reaches the stop token (as opposed to returning all requests at the same time).

Is this concept compatible with CTranslate2 architecture? I am keen to build an inference engine on top of CTranslate2, would love to hear some thoughts around this before I deep dive into it.

michaelfeil · 2023-07-07T08:07:18Z

#1317

andreapiso · 2023-07-07T08:22:57Z

@michaelfeil is this related? Yes, vLLM supports continuous batching, but I'm looking to understand if Ctranslate can be extended to support that, without using vLLM.

guillaumekln · 2023-07-07T08:34:59Z

Currently it is not possible to add an entry to a batch that is already running. However, you could bufferize incoming requests and batch them together before calling CTranslate2. I think this is already good enough in many situations.
This is already possible. There is a callback parameter to get tokens as soon as they are generated, and finished requests are removed from the batch.

andreapiso · 2023-07-07T13:55:02Z

Yes, bufferize incoming requests and sending them together is what i meant for static batching.

Is 1. not possible today because of a difference in architecture between CT2 and HF Transformers, or is it possible in theory, but the mechanism has not been implemented?

guillaumekln · 2023-07-07T14:59:13Z

CT2 was not designed with the feature in mind so it is not trivial to implement it. But of course it is possible in theory.

guillaumekln added the enhancement New feature or request label Aug 3, 2023

guillaumekln pinned this issue Aug 29, 2023

guillaumekln mentioned this issue Sep 5, 2023

Is batch streaming possible with the Text Generation functions? #1423

Closed

nickchomey mentioned this issue Sep 16, 2023

Ideas for better performance #1140

Open

blackpolarz mentioned this issue Dec 1, 2023

Batch decoding SYSTRAN/faster-whisper#588

Closed

minhthuc2502 mentioned this issue Sep 3, 2024

How to early stop an encoding call? #1768

Closed

MahmoudAshraf97 mentioned this issue Sep 18, 2024

Accept variable-length batch prompts for Whisper #1784

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous batching #1333

Continuous batching #1333

andreapiso commented Jul 6, 2023

michaelfeil commented Jul 7, 2023

andreapiso commented Jul 7, 2023

guillaumekln commented Jul 7, 2023 •

edited

Loading

andreapiso commented Jul 7, 2023

guillaumekln commented Jul 7, 2023

Continuous batching #1333

Continuous batching #1333

Comments

andreapiso commented Jul 6, 2023

michaelfeil commented Jul 7, 2023

andreapiso commented Jul 7, 2023

guillaumekln commented Jul 7, 2023 • edited Loading

andreapiso commented Jul 7, 2023

guillaumekln commented Jul 7, 2023

guillaumekln commented Jul 7, 2023 •

edited

Loading