few questions regarding the implementation of streaming and batching #494

KimMinSang96 · 2024-06-14T04:21:45Z

Streaming: Is there a way to apply streaming? I want to send a query to the server using curl and receive the results token by token. However, I couldn't find any code for this part.

Delayed batching or dynamic batching: Is it possible to adopt a method to store queries in a queue for a certain period and then process them all at once? This refers to a method like the one used in tensorrt-llm, where queries are not executed immediately upon receipt but are accumulated in a queue for a certain time before being processed.

Lastly, I have a few errors. After loading 8 model replicas on a GPU to enhance throughput, there are issues when sending queries simultaneously in a short period. The TCP connections remain in the ESTABLISHED state and do not return requests, staying idle. It seems there is a collision. Is there any way to solve this problem?
I am serving LLaMa-7B on 8 3090Ti (24GB) GPUs. Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

few questions regarding the implementation of streaming and batching #494

few questions regarding the implementation of streaming and batching #494

KimMinSang96 commented Jun 14, 2024

few questions regarding the implementation of streaming and batching #494

few questions regarding the implementation of streaming and batching #494

Comments

KimMinSang96 commented Jun 14, 2024