Serve concurrent requests as in vLLM using continuous batching #10170

pvardanis · 2024-11-04T15:00:44Z

pvardanis
Nov 4, 2024

I know that it is currently possible to start a cpp server and process concurrent requests in parallel but I cannot seem to find anything similar with the python bindings without needing to spin up the cpp server and send concurrent requests via Python.

With vLLM I can serve my model in an async fast api server like:

async def generate(
        self,
        prompt: str
    ) -> AsyncGenerator[str, None]:
        from vllm import SamplingParams

        SAMPLING_PARAM = SamplingParams(max_tokens=max_tokens)
        prompt = PROMPT_TEMPLATE.format(user_prompt=prompt)
        stream = await self.engine.add_request(uuid.uuid4().hex, prompt, SAMPLING_PARAM)

        cursor = 0
        async for request_output in stream:
            text = request_output.outputs[0].text
            yield text[cursor:]
            cursor = len(text)

That way I can serve concurrent requests and also get advantage of continuous batching. Is something like this possible with the python bindings of llamacpp?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serve concurrent requests as in vLLM using continuous batching #10170

{{title}}

Replies: 0 comments

Select a reply

Serve concurrent requests as in vLLM using continuous batching #10170

pvardanis Nov 4, 2024

Replies: 0 comments

pvardanis
Nov 4, 2024