-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorflow (TF) Serving on Multi-GPU box #311
Comments
TF Serving only executes the graph loaded. If you graph is using multiple GPUs TF serving will use them, if you graph is only using one GPU (most common case) then you will be able to run it in one GPU in TF Serving. Probably your best solution is to build a script which loads your graph once per GPU, then uses some code in the CPU to split the batch data into the GPU graphs and finally export the whole graph with support for multiGPU. I am not aware of other solution as it could be to replicate the same graph in every GPU in TF serving, but I guess it could be possible by modifying the serving code. Although I don't think this will be supported anytime soon and you will have to do it by yourself. |
Using CUDA_VISIBLE_DEVICES environment variable you can pin Model Server processes to a specific GPU. Run each of them on a separate port and throw a load balancer in front. If you are handling any significant load you probably want to run model server with --enable_batching |
@jlertle : Thanks for the comment. But even with that, if it is a 4 GPU box, you would have to run 4 servers - which does not seem like the most efficient way to do it. Also, you have to run this on 4 different ports, which in turn adds routing complexity. |
@Immexxx I think that there are not so much resources invested in serving and ecosystem repositories comparing to TF. It is comprensible cause this are a little bit in conflict with selling managed services. |
Guessing I could close this issue based on the above responses, but will be cool for the system to "intelligently" figure out load - for example, a graph requiring only one GPU - and scale it based on available resources - if multiple GPUs are available, replicate the graph - and serve it. Could be a flag when launching TF Serving. In summary, will leave this open for now - in case someone (including me) is looking for an interesting project. In general, TF's handling of GPUs - grabbing all the memory of all GPUs and 'locking' them up but using only a subset for compute does not seem like the most elegant way of doing this. A flag for "auto-replicate and serve" would be nice. |
Let's re-open this; With 8GPU machine, it would be really helpful if the sever was simply alternating requests across N available GPUs. |
+1. This would be incredibly useful. Currently I have a model that I am trying to run on a |
In the end I simply used client-side load balancing and had a separate tensorflow serving server for each GPU |
@zacharynevin . me too. tensorflow serving is a wasting of multi gpus |
another +1 here for the exact use case as above. running this on a p2 instance with extra gpus would be an incredibly easy way to scale up. but right now it requires an instance per gpu |
A natural way to do this would be to use the Session option visible_device_list to create a session for each GPU and then serve the same model on each GPU. Unfortunately due to a significant TensorFlow issue (described in tensorflow/tensorflow#8136 and many related bugs), it seems that you cannot have multiple Sessions in the same process with different visible_device_list settings. There doesn't seem to be any traction to getting it fixed as all the related bugs have been closed without any code changes. |
I placed a comment on one of the many closed tickets referred to by @deadeyegoodwin. The comments on closing those tickets speak as if the functionality must be this way. Yet this works absolutely fine in Tensorflow 1.3!!! We have an app where we call list_local_devices() to get the list of available GPUS. We then create a Session per-GPU, each running the same graph, and use visible_device_list to make each Session see the GPU assigned to it as GPU 0. We have this running in live production code with zero problems. |
The problem is NOT with list_local_devices!! I ran my code once, got the result we use, which is just the string description. Then I commented out the call to list_local_devices and I still get the same crash with the same error: F tensorflow/core/common_runtime/gpu/gpu_id_manager.cc:45] Check failed: cuda_gpu_id.value() == result.first->second (1 vs. 0)Mapping the same TfGpuId to a different CUDA GPU id. TfGpuId: 0 Existing mapped CUDA GPU id: 0 CUDA GPU id being tried to map to: 1 I am going to try some earlier versions of Tensorflow to see if I can identify where this went bad. |
This issue still appears on the latest versions. I'm using an AWS g3.8xlarge instance which has 2 GPUs. TF serving is able to detect both GPUs and initialise them but while running the model it only uses 1 GPU to the maximum. We are on version 1.7, even though the client sends upto 32 requests in parallel, the model server only uses the first GPU (check screenshot from nvidia-smi) Seeing that this ticket is open since quite some time, Is the external load balancer solution(that @zacharynevin suggested)the only solution at the moment? |
Mapping the same TfGpuId to a different CUDA GPU id. TfGpuId: 0 Existing mapped CUDA GPU id: 0 CUDA GPU id being tried to map to: 1 when I use multi GPU in tf1.6, I met the error |
@jlertle @zacharynevin , hi, I would like to ask what commands you were using to run a separate tensorflow serving server for each GPU. I have a machine with two 1080 Tis. My TF-Serving is able to correctly identify both of the GPUs when the
The first command was running fine, but the second one will reach error says
And idea or suggestions? Thanks in advance! |
Hi @sugartom, can you post the output of |
@zacharynevin , thanks for your reply, my output of nvidia-smi (without TF-Serving running) is: +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ Please let me know if you need any more information. Thanks! |
@sugartom the commands and output of nvidia-smi look good. I'm not sure what the issue could be at this time |
@jlertle , thanks for your reply. May I ask what's your TF version? I am using v1.2... |
I'm currently on v1.6 but have used this method since ~v0.8. Have you tried running only on the second GPU? Perhaps try this: export CUDA_VISIBLE_DEVICES=1 && bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9001 --model_name=mnist --model_base_path=/path/to/mnist_model |
Yes, already tried that. My TF-Serving runs perfectly on either gpu:0 or gpu:1 alone, and that's why it confused me so much that running two TF-Serving instances didn't work... |
|
@Immexxx , hi, thanks for your reply!
Anyway, thanks again for your suggestions :-) |
Is this still an issue ? |
Do you mean when a model is trained in model parallelism, then it can be served with multiple GPUs? |
@Immexxx
According to Figure 4 of a ICML2017 paper about device placement, utilising 2 or 4 GPUs will only sightly reduce latency (from 0.1 to ~0.07) comparing with single-GPU in model parallelism training. Is this the goal of utilising multiple GPUs for one request, which is to reduce the minimal latency? Thanks |
In the TensorRT github page they have mentioned "Multi-GPU support. TRTIS can distribute inferencing across all system GPUs." ,is this possible with TFserving already?? |
How to use an appointed GPU to run tensorflow serving with docker.I dont't want to take uo all gpus when running tf serving.So does anybody know?In addition, my model is writen by tf.keras. |
@xuzheyuan624 could you try to set environment variable CUDA_VISIBLE_DEVICES and see if it works? |
@aaroey I tried like this: |
You need to use the Sample command
If you use a config file for tensorflow serving, you can use the following command while running the docker
|
Until tensorflow serving incorporates the multi gpu load balancing, the best approach to do multi gpu inference for TensorflowServing is by having dockerised tensorflow serving containers for each gpu (#311 (comment)) and then writing your own load balancing logic which has a list of all the url-port number combinations and dispatches requests accordingly |
@t27 thanks! It's very useful! |
Did someone meet throughput decreasing problem when using load balancing over multiple tf-serving instances? For only one instance, using batching with fine tuning it can reach 500 qps, but for 8 intances, It can't reach 500 * 8 = 4000 qps, even worse, it can't reach 500 qps. I know batching is a main cause but till now I do not know why and also not find a better solution to make full use of multiple gpu devices |
@troycheng |
If we only serve one model with one model version in multiple-gpu server, we have implemented the functionality to automatically start multiple session to load the model which has been edit for binding different gpu devices. This can utilize all the gpu devices for one model and rebalance the inference requests by simple strategies like RoundRobin. However, the default implementation of servable in TensorFlow Serving manages only one Session by use TensorFlow C API to load SavedModel. We have to implement the new Servable or SourceAdapter to support this. There are some issues for multiple models or multiple model versions if we allocate all gpu resource to one model version. It is related to the usage and design of this project. We are willing to contribute what we have done if we have the clear design for multiple gpu support. |
Thank you for your information. So if there is already a demo or related code in this project?? Thank you very much! |
Closing this issue, Please feel free to reopen if this still exists.Thanks |
This is still an issue |
I am facing the same problem on TF 2.11, I have to GPUs but its only picking one during the serving. |
The issue is for a single model (not multi-model)
Is this dependent on how the model is loaded and exported?
For inference (and not training) - Is there an example of a saved model being loaded onto multiple GPUs (with a single CPU) where the CPU splits the GPU load among multiple GPUs instead of using only one GPU.
The text was updated successfully, but these errors were encountered: