-
-
Notifications
You must be signed in to change notification settings - Fork 16.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YOLOv5 AWS Inferentia Inplace compatibility updates #2953
YOLOv5 AWS Inferentia Inplace compatibility updates #2953
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 Hello @jluntamazon, thank you for submitting a 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:
- ✅ Verify your PR is up-to-date with origin/master. If your PR is behind origin/master an automatic GitHub actions rebase may be attempted by including the /rebase command in a comment body, or by running the following code, replacing 'feature' with the name of your local branch:
git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
git checkout feature # <----- replace 'feature' with local branch name
git rebase upstream/master
git push -u origin -f
- ✅ Verify all Continuous Integration (CI) checks are passing.
- ✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee
@jluntamazon thanks for the PR! There seems to be a recent change in GitHub actions that are preventing automatic tests on each new commit for first time contributors, but you can effectively run the same suite of tests here (exit code zero passes). We don't have tests currently that compare training/inference output values, but I'll review this a little further myself to verify. rm -rf runs # remove runs/
for m in yolov5s; do # models
python train.py --weights $m.pt --epochs 3 --img 320 --device 0 # train pretrained
python train.py --weights '' --cfg $m.yaml --epochs 3 --img 320 --device 0 # train scratch
for d in 0 cpu; do # devices
python detect.py --weights $m.pt --device $d # detect official
python detect.py --weights runs/train/exp/weights/best.pt --device $d # detect custom
python test.py --weights $m.pt --device $d # test official
python test.py --weights runs/train/exp/weights/best.pt --device $d # test custom
done
python hubconf.py # hub
python models/yolo.py --cfg $m.yaml # inspect
python models/export.py --weights $m.pt --img 640 --batch 1 # export
done |
@jluntamazon everything seem ok at first glance. I had an idea: would it be possible to clone the input on class Module(nn.Module):
def __init__(self, inplace=True):
super().__init__()
self.inplace = inplace
def forward(self, x):
if not self.inplace:
x = x.clone()
x *= 2 # common code regardless of inplace
return x EDIT: @jluntamazon just realized I can't test this myself since the problem we're trying to patch is on your side. Do you think you could try the above |
@jluntamazon ok I've got 3 profiling results here on CPU (can't profile GPU unfortunately). The proposed PR is about 2.5
master: 162 msPR: 400 ms.clone(): 366 ms |
Thanks for the quick feedback!
The key issue is that the Neuron compilation process currently doesn't support in-place assignment, so unfortunately a clone does not solve the issue. To give some additional background, on Neuron we are compiling the graph into an optimized format that is distinct from the original torch operations/graph. We trace the model and then send that graph to our optimizing compiler. What we get in the end is a 1 operation torch graph (unless we have unsupported operators) where the fused computation is performed completely on chip. This means we cannot properly compare on a per-op basis due compiler optimization and the fact that torch operators are no longer run as-is.
For the above tests, I'm assuming you were using |
Here are initial performance results using the concatenation method:
The method for collecting these was by using the torch profiler: import torch
import torch_neuron
import torch.autograd.profiler as profiler
# Load the compiled model
model = torch.jit.load('model.pth')
# Create a sample image
sample = torch.rand([1, 3, 480, 480])
# Warmup the model
for _ in range(8):
model(sample)
# Profile and display results
with profiler.profile(record_shapes=True) as prof:
for _ in range(1000):
model(sample)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10)) An example output with an image of
|
@jluntamazon thanks, I understand now! /rebase |
@jluntamazon ok I've gone through and smoothed out the PR a bit, hopefully without modifying the core functionality. If you go ahead and verify that my updates didn't break anything, then I'm happy to merge on my side assuming the CI checks pass. |
From inspection it all looks good, but I'll checkout the updates and give confirmation later today |
Ran some tests and it performs the same and produces the expected results. Also looking into some further performance optimizations for next steps, but with this change, those improvements should be on our end. |
@jluntamazon PR is merged. Thank you for your contributions! |
@jluntamazon could you take a look at a new PR #2982 that wants to modify your out-of-place Detect() code here please? The change would be applied to yolo.py L60 and add a Thanks! |
* Added flag to enable/disable all inplace and assignment operations * Removed shape print statements * Scope Detect/Model import to avoid circular dependency * PEP8 * create _descale_pred() * replace lost space * replace list with tuple Co-authored-by: Glenn Jocher <[email protected]>
* Added flag to enable/disable all inplace and assignment operations * Removed shape print statements * Scope Detect/Model import to avoid circular dependency * PEP8 * create _descale_pred() * replace lost space * replace list with tuple Co-authored-by: Glenn Jocher <[email protected]>
* Added flag to enable/disable all inplace and assignment operations * Removed shape print statements * Scope Detect/Model import to avoid circular dependency * PEP8 * create _descale_pred() * replace lost space * replace list with tuple Co-authored-by: Glenn Jocher <[email protected]> (cherry picked from commit 41f5cc5)
* Added flag to enable/disable all inplace and assignment operations * Removed shape print statements * Scope Detect/Model import to avoid circular dependency * PEP8 * create _descale_pred() * replace lost space * replace list with tuple Co-authored-by: Glenn Jocher <[email protected]>
This addresses issues with compiling for AWS neuron by allowing users to remove slice assignment operators. (#2643, aws-neuron/aws-neuron-sdk#253). There is an existing work-around that allows part of the model to compile to neuron, but this change allows the entire model to be compiled in the upcoming Neuron SDK release. This should provide better performance and a more seamless user experience when using Neuron.
Code Changes:
This adds an
inplace
flag to theModel
andDetect
layers of the model since these are the only internal modules that use in-place assignment. By default theinplace
flag isTrue
which means that behavior is unchanged.The flag can now be toggled either by passing it to
attempt_load
or as a top-level configuration in thecfg
YAML.Potential Improvements:
detect.py
but I could if that would be useful.Detect
/Model
objects in theattempt_load
function, I scope the import to avoid a circular dependency. I think ideallyattempt_load
should be moved out ofexperimental.py
, but this could potentially break workflows.🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Enhanced compatibility and configurability of YOLOv5 model operations.
📊 Key Changes
inplace
parameter to control whether operations modify tensors in-place.Detect
class and model-loading function to accommodate the newinplace
argument.forward_augment
and_descale_pred
methods for clarity.🎯 Purpose & Impact