Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI] Parallelize GPU NMS inner loop #7172

Merged
merged 7 commits into from
Dec 30, 2020
Merged

Conversation

masahi
Copy link
Member

@masahi masahi commented Dec 28, 2020

This is a follow-up to #7136. I found a simple way to parallelize the inner loop of GPU NMS, which is currently done in a sequential way since #6839 and hence extremely slow if the number of input box is large. This change brings massive speedup on object detection models from PyTorch and Gluon, as shown below.

GPU NMS workload from PyTorch MaskRCNN (4500 boxes)
Before: 2.1 sec
After: 5.78 milli sec 

Workload from Gluon SSD
Before: 12.5 milli sec
After: 0.206 milli sec

Before I explain what I did, here is how currently we do the sequential, O(N ** 2) triangle loop. This is done by a single thread.

# The outer loop goes through boxes sorted by their score.
for j in range(nboxes):
    # The inner loop check if box j does not overlap with all other preceding boxes k
    for k in range(j):
        if box k is valid:
            do IOU test between j and k and possibly invalidate j

My parallelization instead does the above triangle in the following way:

for j in range(nboxes):
   if j is still valid:
        for k in range(j + 1, nboxes) in parallel:
             do IOU test between j and k and possibly invalidate box k
   # Flush IOU test results from the inner loop to global memory before continuing to the next loop
   # So that other threads can also see them
   syncthreads() 

The idea is, at the start of the inner loop, box j is assumed to be a valid box, and the inner loop invalidates other succeeding boxes that have high overlap with the newly found valid box j. The inner loop can be trivially done in parallel and the number of IOU tests reduces to O(# selected boxes * N).

Now, the inner loop is done in parallel and the other loop is sequential. All threads need to do the outer loop in a lock step: The results of checking if box j is still valid must be consistent across all threads. Since we cannot do global sync inside kernels, I use only one thread block for parallelization and use __syncthreads() after each inner loop.

I have one more PR coming to optimize GPU NMS IR further.

please review @Laurawly @kevinthesun @zhiics @vinx13

python/tvm/topi/cuda/nms.py Outdated Show resolved Hide resolved
python/tvm/topi/cuda/nms.py Show resolved Hide resolved
python/tvm/topi/cuda/nms.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Laurawly Laurawly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@masahi masahi merged commit 66e123f into apache:main Dec 30, 2020
@masahi
Copy link
Member Author

masahi commented Dec 30, 2020

thanks @Laurawly

@mbrookhart
Copy link
Contributor

Kudos!

tkonolige pushed a commit to tkonolige/incubator-tvm that referenced this pull request Jan 11, 2021
* make NMS inner loop parallel

* use one block two avoid global sync issue

* temp disable write by only thread 0

* leave a TODO on write by only one thread

* add some comments, remove check the check on negative class id

* minor improvement when topk is available

* fix write by a single thread
TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Jan 20, 2021
* make NMS inner loop parallel

* use one block two avoid global sync issue

* temp disable write by only thread 0

* leave a TODO on write by only one thread

* add some comments, remove check the check on negative class id

* minor improvement when topk is available

* fix write by a single thread
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Jan 21, 2021
* make NMS inner loop parallel

* use one block two avoid global sync issue

* temp disable write by only thread 0

* leave a TODO on write by only one thread

* add some comments, remove check the check on negative class id

* minor improvement when topk is available

* fix write by a single thread
electriclilies pushed a commit to electriclilies/tvm that referenced this pull request Feb 18, 2021
* make NMS inner loop parallel

* use one block two avoid global sync issue

* temp disable write by only thread 0

* leave a TODO on write by only one thread

* add some comments, remove check the check on negative class id

* minor improvement when topk is available

* fix write by a single thread
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants