-
-
Notifications
You must be signed in to change notification settings - Fork 16.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maybe the cheapest cloud inference option for Yolov5 (AWS Neuron inf1 instance) #2643
Comments
@Ownmarc that's really interesting! BTW if the aws inference instances can exploit FP16 inference then they should benefit from the new autocast PR #2641 I made today for PyTorch Hub models. This seemed to cut about 1/3 of the inference time off on a Colab T4. The results.print() method also now displays profiling results: I think up to now the best performance:cost ratio I've seen is from T4 instances, but I have not tried the inf1-xlarge instances either. Is a 4-GPU/neuron instance the smallest they get? EDIT: I should clarify, the best performance:cost from enterprise GPUs hosted on the large cloud providers is probably from T4s. The best overall GPU performance per cost would probably be from 1080ti's on consumer clouds like vast.ai. |
@glenn-jocher from what AWS says, Amazon EC2 Inf1 instances based on AWS Inferentia chips deliver up to 30% higher throughput and up to 45% lower cost per inference than Amazon EC2 G4 instances. Those G4 instance use the T4 gpu. That is mostlikely assuming the model fully compiles for their chip which is currently not the case for yolov5. Yes the inf1 xlarge is the smallest they offer with this Inferentia chip |
@Ownmarc that's interesting! AWS seems to like offering larger chunks of stuff than GCP. The AWS P4 instances are only available as full 8x A100 machines on AWS, whereas on GCP you can get a smaller instance with 1, 2, 4 or 8 A100s. Are there any special steps you need to get started with the Inf1 instances? BTW I forgot to tell you, when timing cuda ops its important to make calls to torch.cuda.synchronize() so you don't get incorrect times. We have a helper function that can replace time.time() that does this here: Lines 89 to 94 in 2bf34f5
This function is always used when profiling (detect.py, pytorch hub, test.py etc.) |
Yea, so you need to compile the model for their inferencia chip, its pretty easy with the common frameworks (torch/tensorflow/mxnet), there is a couple of tutorials on how to do so. And yes, I know there is some extra steps required to get the real time taken to make the inference, when they release a version that will get your model to fully compile, I'll give it a shot again and try to get better metrics! :) |
@Ownmarc interesting stuff. I responded over at aws-neuron/aws-neuron-sdk#253 to let them know I'm available to offer any help I can. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Been trying the inf1 ec2 instance from AWS with their own Inferentia chips.
https://aws.amazon.com/ec2/instance-types/inf1/
The Yolov5 model doesn't fully compile yet for their accelerated inference chip, but it is still working pretty well. This issue on their repo will be tracking the support of Yolov5 : aws-neuron/aws-neuron-sdk#253
I have done some really basic speed test (the zidane image ran 10 times, batch size = 1) to compare my 1080ti versus 1 neuron core of this Inferentia chip (there is 4 neuron core per chip, so if the inference job can be parallelized, we can divide the inference time by 4 as each neuron core could have its own yolov5 model loaded and infer independently)
I'll keep this updated when they'll update their issue about yolov5 support and maybe I'll make a simple tutorial to show how easy it is to compile and run Yolov5 on their chip.
The text was updated successfully, but these errors were encountered: