Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YOLOv5 AWS Inferentia Inplace compatibility updates #2953

Merged
merged 10 commits into from
Apr 30, 2021

Conversation

jluntamazon
Copy link
Contributor

@jluntamazon jluntamazon commented Apr 27, 2021

This addresses issues with compiling for AWS neuron by allowing users to remove slice assignment operators. (#2643, aws-neuron/aws-neuron-sdk#253). There is an existing work-around that allows part of the model to compile to neuron, but this change allows the entire model to be compiled in the upcoming Neuron SDK release. This should provide better performance and a more seamless user experience when using Neuron.

Code Changes:

This adds an inplace flag to the Model and Detect layers of the model since these are the only internal modules that use in-place assignment. By default the inplace flag is True which means that behavior is unchanged.

The flag can now be toggled either by passing it to attempt_load or as a top-level configuration in the cfg YAML.

Potential Improvements:

  • I did not expose this flag to any of the scripts like detect.py but I could if that would be useful.
  • I did not add unit tests to ensure that the layers functioned identically. If there's a good place to add unit tests, let me know.
  • Since I check for the Detect/Model objects in the attempt_load function, I scope the import to avoid a circular dependency. I think ideally attempt_load should be moved out of experimental.py, but this could potentially break workflows.

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Enhanced compatibility and configurability of YOLOv5 model operations.

📊 Key Changes

  • Added inplace parameter to control whether operations modify tensors in-place.
  • Adjusted the Detect class and model-loading function to accommodate the new inplace argument.
  • Improved forward pass to be compatible with AWS Inferentia processors.
  • Refactored augmentation code into forward_augment and _descale_pred methods for clarity.

🎯 Purpose & Impact

  • 🛠 Provides flexibility in modifying tensors, allowing compatibility with platforms like AWS Inferentia that do not support in-place operations.
  • 🧠 Makes the codebase more modular and easier to maintain by cleanly separating augmentation logic.
  • 🚀 Potential impact includes more efficient deployment on diverse platforms, including cloud inference acceleration services, without sacrificing performance.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Hello @jluntamazon, thank you for submitting a 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

  • ✅ Verify your PR is up-to-date with origin/master. If your PR is behind origin/master an automatic GitHub actions rebase may be attempted by including the /rebase command in a comment body, or by running the following code, replacing 'feature' with the name of your local branch:
git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
git checkout feature  # <----- replace 'feature' with local branch name
git rebase upstream/master
git push -u origin -f
  • ✅ Verify all Continuous Integration (CI) checks are passing.
  • ✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 27, 2021

@jluntamazon thanks for the PR!

There seems to be a recent change in GitHub actions that are preventing automatic tests on each new commit for first time contributors, but you can effectively run the same suite of tests here (exit code zero passes). We don't have tests currently that compare training/inference output values, but I'll review this a little further myself to verify.

rm -rf runs  # remove runs/
for m in yolov5s; do  # models
  python train.py --weights $m.pt --epochs 3 --img 320 --device 0  # train pretrained
  python train.py --weights '' --cfg $m.yaml --epochs 3 --img 320 --device 0  # train scratch
  for d in 0 cpu; do  # devices
    python detect.py --weights $m.pt --device $d  # detect official
    python detect.py --weights runs/train/exp/weights/best.pt --device $d  # detect custom
    python test.py --weights $m.pt --device $d # test official
    python test.py --weights runs/train/exp/weights/best.pt --device $d # test custom
  done
  python hubconf.py  # hub
  python models/yolo.py --cfg $m.yaml  # inspect
  python models/export.py --weights $m.pt --img 640 --batch 1  # export
done

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 28, 2021

@jluntamazon everything seem ok at first glance. I had an idea: would it be possible to clone the input on inplace=False rather than introduce separate routing for the input? Here's an example, I haven't tried this yet so not sure if it works or if it profiles slower or faster than the current concatenation ops. On the plus side this may be faster and reduces the codebase by eliminating the separate handling of the two cases.

class Module(nn.Module):
    def __init__(self, inplace=True):
        super().__init__()
        self.inplace = inplace

    def forward(self, x):
        if not self.inplace:
            x = x.clone()
        x *= 2  # common code regardless of inplace
        return x

EDIT: @jluntamazon just realized I can't test this myself since the problem we're trying to patch is on your side. Do you think you could try the above .clone() solution on your side to see if it addresses the issue? If it works let me know, then the next step is to profile both implementations and merge the faster one if there is a difference, or the simpler one if they are comparable. Thanks!

@glenn-jocher
Copy link
Member

@jluntamazon ok I've got 3 profiling results here on CPU (can't profile GPU unfortunately). The proposed PR is about 2.5
x slower than master in the modified code region starting at yolo.py L56, .clone() proposed solution is about 2.3x slower. Profiling command is

python test.py

master: 162 ms

Screenshot 2021-04-28 at 13 47 41

PR: 400 ms

Screenshot 2021-04-28 at 13 49 50

.clone(): 366 ms

Screenshot 2021-04-28 at 13 54 39

@jluntamazon
Copy link
Contributor Author

jluntamazon commented Apr 28, 2021

Thanks for the quick feedback!

would it be possible to clone the input on inplace=False rather than introduce separate routing for the input?

The key issue is that the Neuron compilation process currently doesn't support in-place assignment, so unfortunately a clone does not solve the issue.

To give some additional background, on Neuron we are compiling the graph into an optimized format that is distinct from the original torch operations/graph. We trace the model and then send that graph to our optimizing compiler. What we get in the end is a 1 operation torch graph (unless we have unsupported operators) where the fused computation is performed completely on chip. This means we cannot properly compare on a per-op basis due compiler optimization and the fact that torch operators are no longer run as-is.

If it works let me know, then the next step is to profile both implementations and merge the faster one if there is a difference, or the simpler one if they are comparable

For the above tests, I'm assuming you were using yolov5s? I could provide a profile with the compiled neuron version to give a better idea of how it performs

@jluntamazon
Copy link
Contributor Author

Here are initial performance results using the concatenation method:

Model Variant Image Size Time (ms)
yolov5s 480x480 17.931
yolov5s 384x640 19.611
yolov5s 640x480 24.135
yolov5s 640x640 34.050

The method for collecting these was by using the torch profiler:

import torch
import torch_neuron
import torch.autograd.profiler as profiler

# Load the compiled model
model = torch.jit.load('model.pth')

# Create a sample image
sample = torch.rand([1, 3, 480, 480])

# Warmup the model
for _ in range(8):
	model(sample)

# Profile and display results
with profiler.profile(record_shapes=True) as prof:
	for _ in range(1000):
		model(sample)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

An example output with an image of 480x480:

---------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------  ------------  ------------  ------------  ------------  ------------  ------------  
              forward         0.08%      14.675ms       100.00%       17.931s      17.931ms          1000  
    neuron::forward_4        99.85%       17.904s        99.92%       17.917s      17.917ms          1000  
          aten::empty         0.04%       7.614ms         0.04%       7.614ms       1.903us          4000  
           aten::set_         0.03%       5.024ms         0.03%       5.024ms       1.256us          4000  
---------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 17.931s

@glenn-jocher
Copy link
Member

@jluntamazon thanks, I understand now!

/rebase

@glenn-jocher
Copy link
Member

@jluntamazon ok I've gone through and smoothed out the PR a bit, hopefully without modifying the core functionality. If you go ahead and verify that my updates didn't break anything, then I'm happy to merge on my side assuming the CI checks pass.

@jluntamazon
Copy link
Contributor Author

From inspection it all looks good, but I'll checkout the updates and give confirmation later today

@jluntamazon
Copy link
Contributor Author

Ran some tests and it performs the same and produces the expected results. Also looking into some further performance optimizations for next steps, but with this change, those improvements should be on our end.

@glenn-jocher glenn-jocher changed the title Added flag to enable/disable inplace operations YOLOv5 AWS Inferentia Inplace compatibility updates Apr 30, 2021
@glenn-jocher glenn-jocher added the enhancement New feature or request label Apr 30, 2021
@glenn-jocher
Copy link
Member

@jluntamazon PR is merged. Thank you for your contributions!

@glenn-jocher
Copy link
Member

glenn-jocher commented May 3, 2021

@jluntamazon could you take a look at a new PR #2982 that wants to modify your out-of-place Detect() code here please?

The change would be applied to yolo.py L60 and add a .view() call. The .view() operation does NOT change the shape of wh, the reason for this is that ONNX export requires an explicit view shape which it seems to lack otherwise on export.
wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
to:
wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i].view(1, self.na, 1, 1, 2) # wh

Thanks!

KMint1819 pushed a commit to KMint1819/yolov5 that referenced this pull request May 12, 2021
* Added flag to enable/disable all inplace and assignment operations

* Removed shape print statements

* Scope Detect/Model import to avoid circular dependency

* PEP8

* create _descale_pred()

* replace lost space

* replace list with tuple

Co-authored-by: Glenn Jocher <[email protected]>
danny-schwartz7 pushed a commit to danny-schwartz7/yolov5 that referenced this pull request May 22, 2021
* Added flag to enable/disable all inplace and assignment operations

* Removed shape print statements

* Scope Detect/Model import to avoid circular dependency

* PEP8

* create _descale_pred()

* replace lost space

* replace list with tuple

Co-authored-by: Glenn Jocher <[email protected]>
Lechtr pushed a commit to Lechtr/yolov5 that referenced this pull request Jul 20, 2021
* Added flag to enable/disable all inplace and assignment operations

* Removed shape print statements

* Scope Detect/Model import to avoid circular dependency

* PEP8

* create _descale_pred()

* replace lost space

* replace list with tuple

Co-authored-by: Glenn Jocher <[email protected]>
(cherry picked from commit 41f5cc5)
BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022
* Added flag to enable/disable all inplace and assignment operations

* Removed shape print statements

* Scope Detect/Model import to avoid circular dependency

* PEP8

* create _descale_pred()

* replace lost space

* replace list with tuple

Co-authored-by: Glenn Jocher <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants