Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorflow serving batch inference slow !!!! #1483

Closed
sevenold opened this issue Nov 8, 2019 · 13 comments
Closed

tensorflow serving batch inference slow !!!! #1483

sevenold opened this issue Nov 8, 2019 · 13 comments
Assignees
Labels
needs prio stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response type:performance Performance Issue

Comments

@sevenold
Copy link

sevenold commented Nov 8, 2019

Excuse me, how to solve the problem of slow speed?
shape:(1, 32, 387, 1)
data time: 0.005219221115112305
post time: 0.24771547317504883
end time: 0.2498164176940918
shape:(2, 32, 387, 1)
data time: 0.0056378841400146484
post time: 0.4651315212249756
end time: 0.4693586826324463

docker run --runtime=nvidia -it --rm -p 8501:8501
-v "$(pwd)/densenet_ctc:/models/docker_test"
-e MODEL_NAME=docker_test tensorflow/serving:latest-gpu
--tensorflow_intra_op_parallelism=8
--tensorflow_inter_op_parallelism=8
--enable_batching=true
--batching_parameters_file=/models/docker_test/batching_parameters.conf

num_batch_threads { value: 4 }
batch_timeout_micros { value: 2000}
max_batch_size {value: 48}
max_enqueued_batches {value: 48}

GPU:1080Ti
Thanks.

@rmothukuru rmothukuru self-assigned this Nov 8, 2019
@rmothukuru rmothukuru added the type:performance Performance Issue label Nov 8, 2019
@rmothukuru
Copy link

@sevenold,
Can you please let us know what is the GPU Utilization during Serving. Problem might be low GPU Utilization.

Can you please try running the Container with the below parameters and let us know if it resolves your issue. Thanks!

--grpc_channel_arguments=grpc.max_concurrent_streams=1000
--per_process_gpu_memory_fraction=0.7
--enable_batching=true
--max_batch_size=10
--batch_timeout_micros=1000
--max_enqueued_batches=1000
--num_batch_threads=6
--batching_parameters_file=/models/flow2_batching.config
--tensorflow_session_parallelism=2 \ 

For more information, please refer #1440

@sevenold
Copy link
Author

sevenold commented Nov 8, 2019

@rmothukuru
I try running the Container with the below parameters but the same result.


docker run --runtime=nvidia -it --rm -p 8501:8501
-v "$(pwd)/densenet_ctc:/models/docker_test"
-e MODEL_NAME=docker_test tensorflow/serving:latest-gpu
--grpc_channel_arguments=grpc.max_concurrent_streams=1000
--per_process_gpu_memory_fraction=0.7
--enable_batching=true
--max_batch_size=128
--batch_timeout_micros=1000
--max_enqueued_batches=1000
--num_batch_threads=8
--batching_parameters_file=/models/docker_test/batching_parameters.conf
--tensorflow_session_parallelism=2


image
it's also low GPU Utilization.


@rmothukuru
Copy link

@sevenold,
Can you please confirm that you have gone through the issue, #1440 and issue still persists.
If so, can you please share your Model so that we can reproduce the issue at our side. Thanks!

@sevenold
Copy link
Author

sevenold commented Nov 11, 2019

@rmothukuru Thanks.
google drive
This is my model and client.

@sevenold
Copy link
Author

@rmothukuru
I tested my other models, such as the verification code recognition model, and the parameters are the same, it is normal to use gpu for prediction.Thanks!

@leo-XUKANG
Copy link

maybe you can try the grpc channel

@sevenold
Copy link
Author

maybe you can try the grpc channel

I tried but the same result.

@RainZhang1990
Copy link

Same question . Seems like tf serving predicts images tandem even I post multiple images one time.

@misterpeddy
Copy link
Member

what happens when you load up the model with TF? Do you get significantly better inference latency? your TF runtime requires X time to do a forward pass on your model on a batch of examples, X becomes a lower bound for your inference latency with TF Serving.

@ganler
Copy link

ganler commented Apr 2, 2020

I found that the serialization(of FP16 data) is of great overhead in the gRPC client API.
And this heavily drops the QPS. And in my case, I use 3x224x244 as the data to be transferred.
The serialization cost is 2 times as the server processing time in the ResNet50 model.

@owenljn
Copy link

owenljn commented Sep 15, 2021

Is this issue solved?
I'm having the same problem when serving a OpenNMT tensorflow model. I have configured the --rest_api_num_threads=1000 and --grpc_channel_arguments=grpc.max_concurrent_streams=1000
they just won't work somehow, the tensorflow server keeps saying gRPC resource exhausted, I can't send more than 15 requests in concurrent threads.

@singhniraj08
Copy link

@oohx,

Could you please provide some more information for us to debug this issue?
We would like to understand how the same model with same batching data performs in Tensorflow. Could you please share the latency of your model doing inference in TF runtime and same model doing inference in TF serving.

If your TF runtime requires X time to do a forward pass on your model on a batch of examples, X becomes a lower bound for your inference latency with TF Serving. Also, please refer to performance guide.

Thank you!

@singhniraj08 singhniraj08 added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Feb 28, 2023
@github-actions
Copy link

This issue was closed due to lack of activity after being marked stale for past 14 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs prio stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests

9 participants