Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

few questions regarding the implementation of streaming and batching #494

Open
KimMinSang96 opened this issue Jun 14, 2024 · 0 comments
Open

Comments

@KimMinSang96
Copy link

Streaming: Is there a way to apply streaming? I want to send a query to the server using curl and receive the results token by token. However, I couldn't find any code for this part.

Delayed batching or dynamic batching: Is it possible to adopt a method to store queries in a queue for a certain period and then process them all at once? This refers to a method like the one used in tensorrt-llm, where queries are not executed immediately upon receipt but are accumulated in a queue for a certain time before being processed.

Lastly, I have a few errors. After loading 8 model replicas on a GPU to enhance throughput, there are issues when sending queries simultaneously in a short period. The TCP connections remain in the ESTABLISHED state and do not return requests, staying idle. It seems there is a collision. Is there any way to solve this problem?
I am serving LLaMa-7B on 8 3090Ti (24GB) GPUs. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant