[TOPI] Parallelize GPU NMS inner loop #7172

masahi · 2020-12-28T10:53:42Z

This is a follow-up to #7136. I found a simple way to parallelize the inner loop of GPU NMS, which is currently done in a sequential way since #6839 and hence extremely slow if the number of input box is large. This change brings massive speedup on object detection models from PyTorch and Gluon, as shown below.

GPU NMS workload from PyTorch MaskRCNN (4500 boxes)
Before: 2.1 sec
After: 5.78 milli sec 

Workload from Gluon SSD
Before: 12.5 milli sec
After: 0.206 milli sec

Before I explain what I did, here is how currently we do the sequential, O(N ** 2) triangle loop. This is done by a single thread.

# The outer loop goes through boxes sorted by their score.
for j in range(nboxes):
    # The inner loop check if box j does not overlap with all other preceding boxes k
    for k in range(j):
        if box k is valid:
            do IOU test between j and k and possibly invalidate j

My parallelization instead does the above triangle in the following way:

for j in range(nboxes):
   if j is still valid:
        for k in range(j + 1, nboxes) in parallel:
             do IOU test between j and k and possibly invalidate box k
   # Flush IOU test results from the inner loop to global memory before continuing to the next loop
   # So that other threads can also see them
   syncthreads()

The idea is, at the start of the inner loop, box j is assumed to be a valid box, and the inner loop invalidates other succeeding boxes that have high overlap with the newly found valid box j. The inner loop can be trivially done in parallel and the number of IOU tests reduces to O(# selected boxes * N).

Now, the inner loop is done in parallel and the other loop is sequential. All threads need to do the outer loop in a lock step: The results of checking if box j is still valid must be consistent across all threads. Since we cannot do global sync inside kernels, I use only one thread block for parallelization and use __syncthreads() after each inner loop.

I have one more PR coming to optimize GPU NMS IR further.

please review @Laurawly @kevinthesun @zhiics @vinx13

python/tvm/topi/cuda/nms.py

Laurawly

LGTM

masahi · 2020-12-30T09:00:33Z

thanks @Laurawly

mbrookhart · 2021-01-04T23:06:58Z

Kudos!

* make NMS inner loop parallel * use one block two avoid global sync issue * temp disable write by only thread 0 * leave a TODO on write by only one thread * add some comments, remove check the check on negative class id * minor improvement when topk is available * fix write by a single thread

masahi commented Dec 28, 2020

View reviewed changes

python/tvm/topi/cuda/nms.py Outdated Show resolved Hide resolved

Laurawly reviewed Dec 29, 2020

View reviewed changes

python/tvm/topi/cuda/nms.py Show resolved Hide resolved

python/tvm/topi/cuda/nms.py Outdated Show resolved Hide resolved

masahi added 6 commits December 30, 2020 10:10

make NMS inner loop parallel

f8afb1e

use one block two avoid global sync issue

cabf8fe

temp disable write by only thread 0

1398eb4

leave a TODO on write by only one thread

5df9160

add some comments, remove check the check on negative class id

2073995

minor improvement when topk is available

919f40b

masahi force-pushed the nms-cuda-opt branch from 0a9e4ab to 44183ae Compare December 30, 2020 01:10

fix write by a single thread

f317ee2

masahi force-pushed the nms-cuda-opt branch from 44183ae to f317ee2 Compare December 30, 2020 01:12

Laurawly approved these changes Dec 30, 2020

View reviewed changes

masahi merged commit 66e123f into apache:main Dec 30, 2020

masahi mentioned this pull request Jan 5, 2021

Parallelize cumsum in get_valid_counts #7123

Merged

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI] Parallelize GPU NMS inner loop #7172

[TOPI] Parallelize GPU NMS inner loop #7172

masahi commented Dec 28, 2020 •

edited

Loading

Laurawly left a comment

masahi commented Dec 30, 2020

mbrookhart commented Jan 4, 2021

[TOPI] Parallelize GPU NMS inner loop #7172

[TOPI] Parallelize GPU NMS inner loop #7172

Conversation

masahi commented Dec 28, 2020 • edited Loading

Laurawly left a comment

Choose a reason for hiding this comment

masahi commented Dec 30, 2020

mbrookhart commented Jan 4, 2021

masahi commented Dec 28, 2020 •

edited

Loading