-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty response returned for prompt responses when using run_server_with_ray.py and batch_size > 1 #137
Comments
Hi richard, I tested the llama-2 7B with run_server_with_ray.py (--batch_size=32). Instead of sent request one by one, I use benchmark script to send 200 request and got 198 response back. I verified the response, they are accuracy and correct, here is on example:
Are you run in on GKE? Can you use main branch latest code and run with a bechmark test? |
I am able to verify that this works if I use the benchmark script. I also verified it on Ray Serve. Why is the behavior different if I send the grpc request one by one? |
Sending multiple prompts to the server, only the first prompt is able to return any results. Requests after the first one would only return an empty response.
I've tried 3 different ways to bring up the server (all using interleave singlehost on a TPU v4):
No issues.
No issues.
This would return the above problem. Debugging the code further, it seems like the stop token was returned from the model:
This only repros with
run_server_with_ray
, and only if thebatch_size
is set to greater than 1.The text was updated successfully, but these errors were encountered: