Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf serving performace is so slow #1989

Closed
liumilan opened this issue Mar 21, 2022 · 14 comments
Closed

tf serving performace is so slow #1989

liumilan opened this issue Mar 21, 2022 · 14 comments
Assignees
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response type:performance Performance Issue

Comments

@liumilan
Copy link

Now I train a recommend nn model offline ,and then predict it by tf serving on cpu online machine. There are 8 cores i just applied, and found it cost slow when predict it. More than 0.4% it costs 100ms when predict. The batch size request is 100. There are 167 one hot features and 3 full-connected layers.And the usage of cpu is also slow, it is only 20% usage.
How can i analyze the bottleneck of serving ,and can it possible to reduce the time cost ratio by adjust some parameters?
i have tried many way follow this link, https://www.tensorflow.org/tfx/serving/performance ,but it can't improve the performace.I doubt if i have so many one hot featues. ,that it cost much time to find hash featues embeddings

@pindinagesh pindinagesh self-assigned this Mar 21, 2022
@pindinagesh pindinagesh added the type:performance Performance Issue label Mar 21, 2022
@pindinagesh
Copy link

Hi @liumilan

Can you take a look at the workaround proposed in this thread and see if it helps in resolving your issue? Also you can refer to link1, link2 which discusses about similar problem. Thanks!

@salliewalecka
Copy link

salliewalecka commented Mar 22, 2022

Hi @liumilan

One gotcha I ran into is that the cpu of TF Serving container is rather spikey and does not show up in the 1min aggregates (so it uses 100% + cpu but on average shows < 50% in some cases). I'm not sure of your serving environment, but if it is in kubernetes I'd recommend plotting CPU throttling to make sure you are not running into that helpful throttling video. Increasing limits will help allow your application to burst into spikes. In addition, you can look into serving your application with more cpu (though that's costly since you are already only at 20% cpu usage).

Apart from that, you can look into attaching tensorboard to look at costly operations -- it is fairly easy to setup. I've not found any other parameters that have helped too much with this problem, only changing resources and changing batch size.

@liumilan
Copy link
Author

liumilan commented Mar 28, 2022

Hi @liumilan

One gotcha I ran into is that the cpu of TF Serving container is rather spikey and does not show up in the 1min aggregates (so it uses 100% + cpu but on average shows < 50% in some cases). I'm not sure of your serving environment, but if it is in kubernetes I'd recommend plotting CPU throttling to make sure you are not running into that helpful throttling video. Increasing limits will help allow your application to burst into spikes. In addition, you can look into serving your application with more cpu (though that's costly since you are already only at 20% cpu usage).

Apart from that, you can look into attaching tensorboard to look at costly operations -- it is fairly easy to setup. I've not found any other parameters that have helped too much with this problem, only changing resources and changing batch size.

i have looked into attaching tensorboard to look at costly operations offline, i have found it cost much time on looking up embedding features,so can it possible to save this time ?@salliewalecka

@salliewalecka
Copy link

Hey, I don't have any more tips for you if it is the embedding features lookup is the bottleneck. Sorry!

@liumilan
Copy link
Author

Hi @liumilan

Can you take a look at the workaround proposed in this thread and see if it helps in resolving your issue? Also you can refer to link1, link2 which discusses about similar problem. Thanks!

i don't think it is the same as issue.My bottleneck is it cost much time to embedding lookup,from tensorflow timeline.
@pindinagesh

@liumilan
Copy link
Author

timeline-1.txt
@pindinagesh here is my timline,could u help to check it? just change name to timeline-1.json,and ,open it by chrome

@liumilan
Copy link
Author

liumilan commented Apr 5, 2022

who can help to check this timeline?

@vscv
Copy link

vscv commented Apr 6, 2022

In fact, other applications have similar performance issue #1991

@liumilan
Copy link
Author

liumilan commented Apr 6, 2022

I also have the same low-performance issue. I guess it mainly comes from two parts:

  1. It takes time to convert the image into JSON payload and POST.
  2. TF serving itself is delayed (posts have been made several times in advance as a warm-up).

Therefore, the result of my POST test on the remote side and the local side is that the remote side (MBP + WIFI) takes 16 ~ 20 seconds to print res.josn. The local side takes 5 ~ 7 seconds. Also, I observed GPU usage, and it only ran (~70%) for less than a second during the entire POST.

# 1024x1024x3 image to json ans POST
image = PIL.Image.open(sys.argv[1])
payload = {"inputs": [image_np.tolist()]}
res = requests.request("POST", "http://2444.333.222.111:8501/v1/models/maskrcnn:predict", data=json.dumps(payload))
print(res.json())

my scenario is recommend ,not cv

@liumilan
Copy link
Author

@pindinagesh @christisg could u help to check timline.json

@singhniraj08 singhniraj08 assigned nniuzft and unassigned christisg Feb 17, 2023
@singhniraj08 singhniraj08 self-assigned this Apr 6, 2023
@singhniraj08
Copy link

@liumilan,

Can you please compare the time taken to generate predictions using Tensorflow runtime and then Tensorflow Serving. Underneath the hood, TensorFlow Serving uses the TensorFlow runtime to do the actual inference on your requests. This means the average latency of serving a request with TensorFlow Serving is usually at least that of doing inference directly with TensorFlow.
That would help us understand if the real issue is with Tensorflow Serving or with model. If embedding lookup is your bottleneck, I would suggest you to re-design your model with inference latency as a design constraint in mind.

In case, tail latency(time taken by tensoflow serving to do inference) results high, you can try gRPC API surface which is slightly more performant. Also, you can experiment with command-line flags (most notably tensorflow_intra_op_parallelism and tensorflow_inter_op_parallelism) to find right configuration for your specific workload and environment.

Thank you!

@github-actions
Copy link

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Apr 14, 2023
@github-actions
Copy link

This issue was closed due to lack of activity after being marked stale for past 7 days.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests

7 participants