Better grouped convolution for CPU targets #6137

Wheest · 2020-07-26T14:47:52Z

This pull request is to replace the current grouped direct convolution algorithm on x86 and Arm targets, with the faster Grouped Spatial Pack Convolutions (GSPC) algorithm.

Here's a performance comparison graph for ResNet34 on a single big core of a Hikey 970 as we increase the number of groups:

Note that in the untuned case the current depthwise convolution outperforms GSPC, thus I have omitted it from the pull request.

This is my first proper full request to TVM, so I may be have some issues I haven't spotted, or style problems.

In short, this commit adds identical GSPC compute definitions and schedules for x86 and arm_cpu targets for grouped convolutions, as well as updating the Relay operator strategy for each.

tqchen · 2020-07-31T15:29:52Z

cc @anijain2305 @FrozenGene @mbaret @giuseros it would be great if you can help to review the PR

tqchen · 2020-07-31T15:31:10Z

@Wheest please help to fix the CI lint error

FrozenGene

The requested change is for we should support asymmetic padding like other compute / schedule.

topi/python/topi/arm_cpu/group_conv2d.py

FrozenGene · 2020-07-31T21:10:03Z

topi/python/topi/arm_cpu/group_conv2d.py

+    # pack kernel
+    shape = (G, KPG//oc_bn, CPG//ic_bn,
+             KH, KW, ic_bn, oc_bn)
+    kernel_vec = te.compute(shape,


I think we could do this in alter_op_layout for kernel, then we won't need do schedule kernel_vec as it will become tensor when we do inference. Could refer spatial pack of arm conv2d.

Regarding the issue I was having, I think I misunderstood, and was trying to alter the shape of both the data and the kernel ahead of time. Am making the fix now.

Hi @FrozenGene, my final blocker to completing this PR is adding the GSPC kernel layout to the C++ runtime. I've got the Python side working, however the alter_op requires the layout to be available in the C++ runtime, and I'm unsure of how to do this.

See this post on the forums where I explain my issue. Would you be able to give any pointers please?

@merrymercy @minminsun does this issue have the same reason as our auto scheduler layout rewrite issue? Do you have any idea about it?

Hey @FrozenGene @merrymercy @minminsun, any thoughts on adding custom kernel layouts to the C++ runtime, so that alter_op can have kernels reshaped AoT?

If this issue cannot be easily resolved, and the pull request be merged without alter_op_layout, and a new PR be made that tracks the issue, and includes my attempts thus far in solving it?

Yes. Sorry to delay this too long. We could open one issue to tracker it.

Thanks. I believe I have responded to the issues, and will see when the latest commit passes PR. For the nn.utils.infer_pad issue, it's not being used by any code in TVM, but I will I will be making a discussion on the forums about it, as it could be used in future.

FrozenGene · 2020-07-31T21:19:44Z

topi/python/topi/arm_cpu/group_conv2d.py

+    # schedule data
+    if DOPAD:
+        s[A0].compute_inline()
+    groups, batch, ic_chunk, ih, ic_block, iw = s[A1].op.axis


Maybe we should have one compute_at for A1 at CC? Not only just parallel.

FrozenGene · 2020-07-31T21:20:57Z

topi/python/topi/arm_cpu/group_conv2d.py

+
+    # schedule data
+    if DOPAD:
+        s[A0].compute_inline()


Consider using compute_at and vectorize data load. According to experiment of #6095 (comment), we should have better performance than compute inline directly.

ZihengJiang · 2020-09-16T19:45:55Z

Ping for update @FrozenGene @Wheest

Wheest · 2020-09-16T19:56:43Z

@ZihengJiang Have updated linting errors.

Have added asymmetric padding support, however this isn't consistent across the rest of TOPI. Have added a new workload type in python/tvm/topi/nn/conv2d.py, that can be adopted across other Conv2D implementations.

Have been looking at suggested scheduled improvements, however have not been able to get any improvements to date.

Have not yet figured out how to do the kernel packing step in alter_op_layout, but I think I can do when I have some time next week.

ZihengJiang · 2020-09-16T20:12:24Z

Sounds good! Just ping the reviewers when you feel it is ready to be reviewed again. Thanks for the works!

Wheest · 2020-12-21T18:54:59Z

Hello there, updating this pull request to be up-to-date with the latest main branch.

In terms of things remaining to do:

Consider using compute_at and vectorize data load - did not get an improvement.
We should support asymmetic padding like other compute / schedule. - this is implemented in GSPC, however requires extending get_workload for Conv2D generally. I began working on this in 505c127, but have reverted it, and will have this as it's own pull request in the comings days.
Pack in alter_op_layout for kernel: have been working on this, but have an issue. My data is being passed to my group_conv2d_NCHWc.x86 in the conv2d_NCHWc format (5D input data), rather than the GSPC format (6D input data). Despite my changes to the x86 _alter_conv2d_layout. See this branch. Some guidance or pointers would be appreciated @FrozenGene.

In the interests of more transparent development, here's part of my test suite.

FrozenGene · 2020-12-25T15:53:40Z

Hello there, updating this pull request to be up-to-date with the latest main branch.

In terms of things remaining to do:

Consider using compute_at and vectorize data load - did not get an improvement.

We should support asymmetic padding like other compute / schedule. - this is implemented in GSPC, however requires extending get_workload for Conv2D generally. I began working on this in 505c127, but have reverted it, and will have this as it's own pull request in the comings days.

Pack in alter_op_layout for kernel: have been working on this, but have an issue. My data is being passed to my group_conv2d_NCHWc.x86 in the conv2d_NCHWc format (5D input data), rather than the GSPC format (6D input data). Despite my changes to the x86 _alter_conv2d_layout. See this branch. Some guidance or pointers would be appreciated @FrozenGene.

In the interests of more transparent development, here's part of my test suite.

Sorry, @Wheest I just noticed your mention. Could you give me some spare time to review your code? I could allocate some buffer next week.

Wheest · 2020-12-26T12:08:56Z

Sure thing, thanks for reviewing! You can see my ongoing PR for the asymmetric padding here

merrymercy · 2020-12-26T12:17:44Z

We can also try the auto-scheduler. It should deliver very good performance for group convolution.
The benefit we get is that we don't need to worry about the schedule templates anymore.
The non-trivial part of the implementation is to enable layout rewrite for group conv2d, which is critical to performance. If you guys are willing to try it, I can give more insructions.

Wheest · 2020-12-26T12:28:34Z

Hi @merrymercy, I'm just reading about the auto-scheduler now, is the RFC the best learning resource? How does it compare to Ansor?

I'm willing to give it a try, though perhaps it would be better to get this PR merged first (since it's providing an immediate and clear performance improvement over the existing solution), and add improvements as another one?

merrymercy · 2020-12-26T13:04:49Z

The RFC you mentioned is very old and that project is deprecated. We started another project "Ansor" to bring auto-scheduler for TVM. Ansor is integrated as tvm.auto_scheduler package in the current code base.
"Ansor" and "Auto-scheduler" in the current code base are the same thing. They are just different names. You can see the RFC for Ansor auto-scheduler here, but the RFC is also slightly old. The best thing to learn it is these tutorials.

I totally agree with you we can merge this PR first.

FrozenGene · 2020-12-29T11:55:04Z

Pack in alter_op_layout for kernel: have been working on this, but have an issue. My data is being passed to my group_conv2d_NCHWc.x86 in the conv2d_NCHWc format (5D input data), rather than the GSPC format (6D input data). Despite my changes to the x86 _alter_conv2d_layout. See this branch. Some guidance or pointers would be appreciated @FrozenGene.

@Wheest Not very clearly understand you problem. inside this pass, we should only handle kernel layout, not related with data layout. Could you explain it a bit more clearly?

FrozenGene · 2020-12-29T11:59:54Z

python/tvm/relay/op/strategy/x86.py

-            if not is_auto_scheduler_enabled():
-                logger.warning("group_conv2d is not optimized for x86 with autotvm.")
-            strategy.add_implementation(
-                wrap_compute_conv2d(topi.nn.group_conv2d_nchw, has_groups=True),
-                wrap_topi_schedule(topi.generic.schedule_group_conv2d_nchw),
-                name="group_conv2d_nchw.generic",
-            )


We should compare your implementation performance with auto scheduler. We shouldn't delete auto scheduler support directly.

What kind of strategy logic would include both? Having this code on an else branch of if not is_auto_scheduler_enabled():? Or something else?

if not is_auto_scheduler_enabled(): ... We prefer auto scheduler be the default stragegy.

Current code should now be able to run with auto scheduler by running with "relay.backend.use_auto_scheduler": True, and otherwise using my schedule.

python/tvm/relay/op/strategy/x86.py

FrozenGene · 2020-12-29T12:02:17Z

python/tvm/topi/x86/group_conv2d.py

+
+    # no stride and padding info here
+    padding = infer_pad(data, data_pad)
+    hpad, wpad = padding


Should this be pad_t, pad_l, pad_b, pad_r?

We only get hpad and wpad from nn.utils.infer_pad, since this information can't be recovered from data and data_pad alone. This has no effect on the schedule, since we only use it to get the DOPAD variable.

I think we should fix the issue of nn.utils.infer_pad as we support asymmetric padding for now. Previous nn.utils.infer_pad only return 2 values is the history reason. However, as it doesn't affect performance and correctness, it is up to you to decide whether to support it or not in this pr.

~~Thanks, I think I will figure out nn.utils.infer_pad as a separate PR this weekend, since there are a couple of design issues around it that need to be answered.~~

The conv2d codebase has changed a lot since I last looked at it. I did some digging, and found that the non-grouped version of spatial pack convolution has dropped using nn.utils.infer_pad, so I have updated the GSPC code to follow a similar convention.

FrozenGene · 2021-03-10T12:46:32Z

python/tvm/relay/op/strategy/x86.py

-            if not is_auto_scheduler_enabled():
-                logger.warning("group_conv2d is not optimized for x86 with autotvm.")
            strategy.add_implementation(
-                wrap_compute_conv2d(topi.nn.group_conv2d_nchw, has_groups=True),
-                wrap_topi_schedule(topi.generic.schedule_group_conv2d_nchw),
-                name="group_conv2d_nchw.generic",
+                wrap_compute_conv2d(topi.x86.group_conv2d_nchw, has_groups=True),
+                wrap_topi_schedule(topi.x86.schedule_group_conv2d_nchw),
+                name="group_conv2d_nchw.x86",
            )
        elif layout == "NHWC":
            assert kernel_layout == "HWIO"
            if not is_auto_scheduler_enabled():
                logger.warning("group_conv2d is not optimized for x86 with autotvm.")
-            strategy.add_implementation(
-                wrap_compute_conv2d(topi.nn.group_conv2d_nhwc, has_groups=True),
-                wrap_topi_schedule(topi.generic.schedule_group_conv2d_nhwc),
-                name="group_conv2d_nhwc.generic",
-            )
+                strategy.add_implementation(
+                    wrap_compute_conv2d(topi.nn.group_conv2d_nhwc, has_groups=True),
+                    wrap_topi_schedule(topi.generic.schedule_group_conv2d_nhwc),
+                    name="group_conv2d_nhwc.generic",
+                )


The fix has bug when we enable auto scheduler. When auto scheduler is enabled, we can not have any compute declaration for auto scheduler, please fix it.

Bug has been fixed by removing the indentation.

FrozenGene · 2021-03-19T09:10:47Z

@Wheest Seems you have wrong rebase. please refer this documentation : https://tvm.apache.org/docs/contribute/git_howto.html

Fixed linting

Fixed merge issue

Wheest · 2021-03-24T12:14:56Z

@Wheest Seems you have wrong rebase. please refer this documentation : https://tvm.apache.org/docs/contribute/git_howto.html

Believe I have fixed the rebase/commit problems, will try to avoid these in future for a cleaner PR.

FrozenGene · 2021-03-25T03:01:14Z

Thanks @Wheest Merged now

Wheest · 2021-03-25T10:34:19Z

Thanks for helping me through my first PR for a major OSS project @FrozenGene. Have learned a lot, and I hope to be a more productive member of the community in future.

tqchen · 2021-03-25T13:36:55Z

Thank you @Wheest ! and @FrozenGene for sheperding

* integrated with v0.8 * Rebase, and undoing accidental removal of auto scheduler NHWC support * Added ASF license header * Minor bug fixes * Added asymmetric padding support Fixed linting * Improve linting * Better linting, disable final linting checks * Fixed final linting errors (figured out how to run lint tests locally) * fixing linter formatting part 1 * fixing linter formatting part 2 * fixing linter formatting part 3 * Update conv2d.py Fixed merge issue * Rebase, and update responding to some comments * Fixed AutoScheduler bug for NHWC case * removed infer_pad from GSPC * Rebase, and undoing accidental removal of auto scheduler NHWC support * Added ASF license header * Minor bug fixes * Added asymmetric padding support Fixed linting * Improve linting * Better linting, disable final linting checks * Fixed final linting errors (figured out how to run lint tests locally) * Update conv2d.py Fixed merge issue * Rebase, and update responding to some comments * Fixed AutoScheduler bug for NHWC case * Minor fix * Fixed removal of infer_pad to no padding * Fixed unexpected linting error Co-authored-by: Perry Gibson <[email protected]>

tqchen added the status: need review label Jul 31, 2020

tqchen added the status: need update need update based on feedbacks label Jul 31, 2020

tqchen assigned icemelon Jul 31, 2020

FrozenGene requested changes Jul 31, 2020

View reviewed changes

tqchen changed the base branch from master to main October 11, 2020 18:26

tqchen assigned merrymercy Dec 26, 2020

FrozenGene requested changes Dec 29, 2020

View reviewed changes

merrymercy assigned FrozenGene and unassigned merrymercy Dec 30, 2020

Wheest force-pushed the master branch from c57bd78 to 3cf4db1 Compare January 6, 2021 18:27

FrozenGene requested changes Mar 10, 2021

View reviewed changes

Wheest force-pushed the master branch from f9414c7 to 03f4458 Compare March 19, 2021 18:08

Wheest and others added 5 commits March 20, 2021 16:25

integrated with v0.8

d572a55

Rebase, and undoing accidental removal of auto scheduler NHWC support

f32132b

Added ASF license header

ca5c67f

Minor bug fixes

737d36f

Added asymmetric padding support

e22adbe

Fixed linting

Wheest and others added 13 commits March 20, 2021 16:25

Rebase, and update responding to some comments

0a26e5c

Fixed AutoScheduler bug for NHWC case

c79a7ca

removed infer_pad from GSPC

bd32372

Rebase, and undoing accidental removal of auto scheduler NHWC support

a3f9e51

Added ASF license header

e741ca7

Minor bug fixes

36f9128

Added asymmetric padding support

12de6b8

Fixed linting

Improve linting

ef858b4

Better linting, disable final linting checks

10d6e40

Fixed final linting errors (figured out how to run lint tests locally)

f468988

Update conv2d.py

6fef83a

Fixed merge issue

Rebase, and update responding to some comments

cf87146

Fixed AutoScheduler bug for NHWC case

8f74750

Wheest force-pushed the master branch from 0c23610 to 8f74750 Compare March 20, 2021 16:36

Wheest and others added 3 commits March 23, 2021 14:19

Minor fix

53b4d5b

Fixed removal of infer_pad to no padding

fd23e11

Fixed unexpected linting error

7a2c3ca

FrozenGene approved these changes Mar 25, 2021

View reviewed changes

FrozenGene merged commit 7130e80 into apache:main Mar 25, 2021

tqchen added status: accepted and removed status: need review status: need update need update based on feedbacks labels Mar 25, 2021

altanh mentioned this pull request Jul 29, 2021

[FIX][CI] hotfix check_grad perf regression #8581

Merged

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Better grouped convolution for CPU targets #6137

Better grouped convolution for CPU targets #6137

Conversation

Wheest commented Jul 26, 2020

tqchen commented Jul 31, 2020

tqchen commented Jul 31, 2020

FrozenGene left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Wheest Mar 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZihengJiang commented Sep 16, 2020

Wheest commented Sep 16, 2020

ZihengJiang commented Sep 16, 2020

Wheest commented Dec 21, 2020 • edited Loading

FrozenGene commented Dec 25, 2020

Wheest commented Dec 26, 2020

merrymercy commented Dec 26, 2020 • edited Loading

Wheest commented Dec 26, 2020

merrymercy commented Dec 26, 2020 • edited Loading

FrozenGene commented Dec 29, 2020 • edited Loading

FrozenGene Dec 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Wheest Mar 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FrozenGene commented Mar 19, 2021

Wheest commented Mar 24, 2021

FrozenGene commented Mar 25, 2021

Wheest commented Mar 25, 2021

tqchen commented Mar 25, 2021

FrozenGene left a comment •

edited

Loading

Wheest Mar 18, 2021 •

edited

Loading

Wheest commented Dec 21, 2020 •

edited

Loading

merrymercy commented Dec 26, 2020 •

edited

Loading

merrymercy commented Dec 26, 2020 •

edited

Loading

FrozenGene commented Dec 29, 2020 •

edited

Loading

FrozenGene Dec 29, 2020 •

edited

Loading

Wheest Mar 18, 2021 •

edited

Loading