[TOPI][CUDA] Add faster-rcnn proposal op #2420

vinx13 · 2019-01-11T05:59:58Z

Please review @masahi @FrozenGene @tqchen @kevinthesun

kevinthesun · 2019-01-11T06:06:38Z

Thank you for working on this! Currently we are using hybrid script developed by @were to replace ir_builder, which makes the code much more readable and easier to debug. It would be nice if you can use hybrid script instead of ir_builder. You can take a look at the generic ssd operators PR to see how hybrid script works: #2353

vinx13 · 2019-01-11T07:42:49Z

There are still some blockers of moving to hybrid.

I used an python for loop to explicitly unroll inside ir_builder.for_range https://github.com/dmlc/tvm/blob/fe9830c99630275641d674620cffb33d25640852/topi/python/topi/cuda/rcnn/proposal.py#L71
cast and ceil are not supported https://github.com/dmlc/tvm/blob/fe9830c99630275641d674620cffb33d25640852/topi/python/topi/cuda/rcnn/proposal.py#L271
Calling global tvm function https://github.com/dmlc/tvm/blob/fe9830c99630275641d674620cffb33d25640852/topi/python/topi/cuda/rcnn/proposal.py#L51 I guess this can be supported after [Hybrid Script] Supporting scheduling hybrid script #2416

cc @were

masahi · 2019-01-11T23:38:34Z

topi/python/topi/cuda/vision.py

+    Parameters
+    ----------
+    outs: Array of Tensor
+      The computation graph description of roi_align


roi_align -> proposal op

masahi · 2019-01-11T23:38:55Z

topi/python/topi/generic/vision.py

+    Parameters
+    ----------
+    outs: Array of Tensor
+      The computation graph description of roi_align


roi_align -> proposal op

masahi · 2019-01-11T23:45:19Z

@vinx13 is this implementation supposed to mirror mxnet one?

topi/python/topi/cuda/rcnn/proposal.py

were · 2019-01-12T02:51:39Z

@vinx13

This is supported by [Hybrid Script] Unify the symbol tables to one; support tvm.container.Array #2366. You can use

for i in const_range(a_complation_time_const):
    # do something

Then the #do something body will be unrolled by the hybrid compiler.

Casting is easy for me to later support it. I apologize that I should to it ealier.
My suggestion to get rid of this global function is that you can wrap up this value with tvm.const and pass it to a hybrid function as a parameter.

Laurawly · 2019-01-13T07:10:55Z

topi/python/topi/cuda/rcnn/proposal.py

+                offset = start + 2 * tid + (k % 2)
+                with ib.if_scope(
+                    tvm.all(offset + 1 < num_bbox, p_data[offset] < p_data[offset + 1])):
+                    temp_data[0] = p_data[offset]


Because different offsets are executed in parallel, if two different offsets both satisfy the condition, they'll compete the usage of temp_data[0] (same thing happens for temp_index[0]). In that case, the argsort result is wrong.

temp_data is local scoped so each thread will have its own temp_data

Have you tested your argsort on slightly larger dataset such as data_buf has shape (1, 500)? I tested it locally and it failed. Testing script can be found here: https://gist.github.com/Laurawly/66e8105c8db300bbce0771c1e58853ad

This is a potential bug in tvm, temp_index and temp_data are in global memory, though I declared them to be local. In fact, if allocate statement is the first one emitted by ir builder, the memory goes to global because allocate is outside the produce of extern. The ir is like

// attr .. strorage_scope = 'local' allocate ... produce extern { ... }

cc @tqchen

hmm, it would be great if you can look a bit into it

@Laurawly tvm can correctly allocate local memory as long as you put the allocate statement inside the thread scope (scope_attr statement in ir builder). otherwise storage write pass cannot find the attach point of the allocation and put it to the beginning (which will be allocated as global, but we may add an assertion). I think current argsort works after we fix the global barrier #2473

@vinx13 I pulled the changes in #2473 and tested your argsort locally but it still fails on test cases with shape (1,500)

@Laurawly Seems I somehow dropped some commits here. After adding global barrier, I can get the right result but sometimes there is deadlock possibly because of the bug in global barrier

@vinx13 Yeah, it works for me for up to (1, 6000). But it's good enough for the test cases.

@Laurawly I changed it to use a single block + vthreads because I met some deadlock after I repeat the test.
@masahi Now we don't need to use global barrier. But still, we need storage_sync, @were can we support this in hybrid?

vinx13 · 2019-01-13T07:20:02Z

@vinx13 is this implementation supposed to mirror mxnet one?

Yes this is the mirror of mxnet MultiProposal

vinx13 · 2019-01-31T08:58:26Z

topi/python/topi/cuda/rcnn/proposal.py

+    nthread_tx = max_threads
+    nthread_bx = num_bbox // max_threads + 1
+    ib.scope_attr(tx, "thread_extent", nthread_tx)
+    ib.scope_attr(bx, "thread_extent", nthread_bx)


@Laurawly In nmr_ir, sometimes valid bboxes are dropped due to conflicts. The only thing I do to fix this issue is binding bx to virtual threads. Could you take a look here?

@vinx13 Your nms_ir looks very similar as mine. I haven't faced test cases which will drop bboxes. But one solution I have in mind is instead of parallelize on i axis, shall we parallelize on l and put i to for loop for serialized writing to p_out?

@Laurawly I don't know how l can be parallelized since sequential dependency on l is required.

@vinx13 I recently tested nms on Mali GPU and the data race occurred, when I changed the blockIdx.x to vthread and added synchronization it worked. But I was able to use blockIdx.x if I initialized the p_out to -1, but I still need the synchronization to make it work.

tqchen · 2019-02-12T18:54:41Z

@masahi please follow up to moderate this PR as https://docs.tvm.ai/contribute/committer_guide.html :)

masahi · 2019-02-12T22:14:44Z

@vinx13 is this PR still WIP?

vinx13 · 2019-02-13T02:09:22Z

@masahi yes, I'm trying to solve the data race. sorry I was on vacation, will pick up this soon

vinx13 · 2019-02-14T08:34:08Z

@Laurawly I have double-checked and found that the dropped boxes are due to float point precision loss. When the number of boxes increases, there is a slight chance that the iou is very close to the threshold so tvm and the ref implementation in mxnet produces different result
@masahi This pr is ready.

masahi · 2019-02-14T10:51:28Z

thanks @vinx13 @Laurawly @kevinthesun this is merged.

* [TOPI][CUDA] Add faster-rcnn proposal op * Fix doc * Add global barrier * Use vthread in argsort * Update sort and nms ir * Fix lint * Update sort ir in ssd nms

vinx13 force-pushed the feature/rpn branch 2 times, most recently from cef8b92 to fe9830c Compare January 11, 2019 07:15

masahi reviewed Jan 11, 2019

View reviewed changes

topi/python/topi/cuda/rcnn/proposal.py Show resolved Hide resolved

Laurawly reviewed Jan 13, 2019

View reviewed changes

masahi self-assigned this Jan 20, 2019

vinx13 force-pushed the feature/rpn branch from d0fb390 to 4e0fa64 Compare January 21, 2019 07:30

were mentioned this pull request Jan 21, 2019

[Hybrid script] Backend support #2477

Merged

vinx13 changed the title ~~[TOPI][CUDA] Add faster-rcnn proposal op~~ [WIP][TOPI][CUDA] Add faster-rcnn proposal op Jan 24, 2019

Laurawly approved these changes Jan 24, 2019

View reviewed changes

kevinthesun approved these changes Jan 25, 2019

View reviewed changes

Laurawly mentioned this pull request Jan 25, 2019

[BugFix] SSD fully supported on GPUs, updated deploy_ssd tutorial #2510

Merged

vinx13 added 5 commits January 31, 2019 16:56

[TOPI][CUDA] Add faster-rcnn proposal op

5ceee47

Fix doc

bdc26ad

Add global barrier

7989752

Use vthread in argsort

5c4a5ac

Update sort and nms ir

3dd43ef

vinx13 commented Jan 31, 2019

View reviewed changes

vinx13 added 2 commits January 31, 2019 17:09

Fix lint

b834123

Update sort ir in ssd nms

be52244

vinx13 force-pushed the feature/rpn branch from 7d37fff to be52244 Compare January 31, 2019 09:13

vinx13 changed the title ~~[WIP][TOPI][CUDA] Add faster-rcnn proposal op~~ [TOPI][CUDA] Add faster-rcnn proposal op Feb 14, 2019

masahi merged commit d20646c into apache:master Feb 14, 2019

yzhliu mentioned this pull request Mar 2, 2019

[DEV] TVM v0.6 Roadmap #2623

Closed

28 tasks

vinx13 mentioned this pull request Mar 4, 2019

[RELAY][OP] Faster-RCNN Proposal OP #2725

Merged

Laurawly mentioned this pull request Mar 7, 2019

[Bugfix]Revert PR #2420 cuda/nms changes #2747

Merged

tqchen mentioned this pull request Nov 8, 2019

[RELEASE][DRAFT] TVM v0.6 Release candidate #4259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI][CUDA] Add faster-rcnn proposal op #2420

[TOPI][CUDA] Add faster-rcnn proposal op #2420

vinx13 commented Jan 11, 2019

kevinthesun commented Jan 11, 2019

vinx13 commented Jan 11, 2019

masahi Jan 11, 2019

masahi Jan 11, 2019

masahi commented Jan 11, 2019

were commented Jan 12, 2019 •

edited

Loading

Laurawly Jan 13, 2019

vinx13 Jan 13, 2019

Laurawly Jan 13, 2019

vinx13 Jan 14, 2019 •

edited

Loading

tqchen Jan 14, 2019

vinx13 Jan 22, 2019 •

edited

Loading

Laurawly Jan 22, 2019

vinx13 Jan 23, 2019 •

edited

Loading

Laurawly Jan 23, 2019

vinx13 Jan 25, 2019

vinx13 commented Jan 13, 2019

vinx13 Jan 31, 2019

Laurawly Jan 31, 2019 •

edited

Loading

vinx13 Feb 1, 2019

Laurawly Feb 14, 2019

tqchen commented Feb 12, 2019

masahi commented Feb 12, 2019

vinx13 commented Feb 13, 2019

vinx13 commented Feb 14, 2019

masahi commented Feb 14, 2019

[TOPI][CUDA] Add faster-rcnn proposal op #2420

[TOPI][CUDA] Add faster-rcnn proposal op #2420

Conversation

vinx13 commented Jan 11, 2019

kevinthesun commented Jan 11, 2019

vinx13 commented Jan 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi commented Jan 11, 2019

were commented Jan 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinx13 Jan 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinx13 Jan 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinx13 Jan 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinx13 commented Jan 13, 2019

Choose a reason for hiding this comment

Laurawly Jan 31, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen commented Feb 12, 2019

masahi commented Feb 12, 2019

vinx13 commented Feb 13, 2019

vinx13 commented Feb 14, 2019

masahi commented Feb 14, 2019

were commented Jan 12, 2019 •

edited

Loading

vinx13 Jan 14, 2019 •

edited

Loading

vinx13 Jan 22, 2019 •

edited

Loading

vinx13 Jan 23, 2019 •

edited

Loading

Laurawly Jan 31, 2019 •

edited

Loading