You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Streaming: Is there a way to apply streaming? I want to send a query to the server using curl and receive the results token by token. However, I couldn't find any code for this part.
Delayed batching or dynamic batching: Is it possible to adopt a method to store queries in a queue for a certain period and then process them all at once? This refers to a method like the one used in tensorrt-llm, where queries are not executed immediately upon receipt but are accumulated in a queue for a certain time before being processed.
Lastly, I have a few errors. After loading 8 model replicas on a GPU to enhance throughput, there are issues when sending queries simultaneously in a short period. The TCP connections remain in the ESTABLISHED state and do not return requests, staying idle. It seems there is a collision. Is there any way to solve this problem?
I am serving LLaMa-7B on 8 3090Ti (24GB) GPUs. Thanks
The text was updated successfully, but these errors were encountered:
Streaming: Is there a way to apply streaming? I want to send a query to the server using curl and receive the results token by token. However, I couldn't find any code for this part.
Delayed batching or dynamic batching: Is it possible to adopt a method to store queries in a queue for a certain period and then process them all at once? This refers to a method like the one used in tensorrt-llm, where queries are not executed immediately upon receipt but are accumulated in a queue for a certain time before being processed.
Lastly, I have a few errors. After loading 8 model replicas on a GPU to enhance throughput, there are issues when sending queries simultaneously in a short period. The TCP connections remain in the ESTABLISHED state and do not return requests, staying idle. It seems there is a collision. Is there any way to solve this problem?
I am serving LLaMa-7B on 8 3090Ti (24GB) GPUs. Thanks
The text was updated successfully, but these errors were encountered: