Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better grouped convolution for CPU targets #6137

Merged
merged 28 commits into from
Mar 25, 2021
Merged

Conversation

Wheest
Copy link
Contributor

@Wheest Wheest commented Jul 26, 2020

This pull request is to replace the current grouped direct convolution algorithm on x86 and Arm targets, with the faster Grouped Spatial Pack Convolutions (GSPC) algorithm.

Here's a performance comparison graph for ResNet34 on a single big core of a Hikey 970 as we increase the number of groups:
tvm_PR_hikey_1thr_ResNet34_tex

Note that in the untuned case the current depthwise convolution outperforms GSPC, thus I have omitted it from the pull request.

This is my first proper full request to TVM, so I may be have some issues I haven't spotted, or style problems.

In short, this commit adds identical GSPC compute definitions and schedules for x86 and arm_cpu targets for grouped convolutions, as well as updating the Relay operator strategy for each.

@tqchen
Copy link
Member

tqchen commented Jul 31, 2020

cc @anijain2305 @FrozenGene @mbaret @giuseros it would be great if you can help to review the PR

@tqchen
Copy link
Member

tqchen commented Jul 31, 2020

@Wheest please help to fix the CI lint error

@tqchen tqchen added the status: need update need update based on feedbacks label Jul 31, 2020
Copy link
Member

@FrozenGene FrozenGene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requested change is for we should support asymmetic padding like other compute / schedule.

topi/python/topi/arm_cpu/group_conv2d.py Outdated Show resolved Hide resolved
topi/python/topi/arm_cpu/group_conv2d.py Outdated Show resolved Hide resolved
# pack kernel
shape = (G, KPG//oc_bn, CPG//ic_bn,
KH, KW, ic_bn, oc_bn)
kernel_vec = te.compute(shape,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do this in alter_op_layout for kernel, then we won't need do schedule kernel_vec as it will become tensor when we do inference. Could refer spatial pack of arm conv2d.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the issue I was having, I think I misunderstood, and was trying to alter the shape of both the data and the kernel ahead of time. Am making the fix now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @FrozenGene, my final blocker to completing this PR is adding the GSPC kernel layout to the C++ runtime. I've got the Python side working, however the alter_op requires the layout to be available in the C++ runtime, and I'm unsure of how to do this.

See this post on the forums where I explain my issue. Would you be able to give any pointers please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merrymercy @minminsun does this issue have the same reason as our auto scheduler layout rewrite issue? Do you have any idea about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @FrozenGene @merrymercy @minminsun, any thoughts on adding custom kernel layouts to the C++ runtime, so that alter_op can have kernels reshaped AoT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this issue cannot be easily resolved, and the pull request be merged without alter_op_layout, and a new PR be made that tracks the issue, and includes my attempts thus far in solving it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Sorry to delay this too long. We could open one issue to tracker it.

Copy link
Contributor Author

@Wheest Wheest Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I believe I have responded to the issues, and will see when the latest commit passes PR. For the nn.utils.infer_pad issue, it's not being used by any code in TVM, but I will I will be making a discussion on the forums about it, as it could be used in future.

# schedule data
if DOPAD:
s[A0].compute_inline()
groups, batch, ic_chunk, ih, ic_block, iw = s[A1].op.axis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have one compute_at for A1 at CC? Not only just parallel.


# schedule data
if DOPAD:
s[A0].compute_inline()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using compute_at and vectorize data load. According to experiment of #6095 (comment), we should have better performance than compute inline directly.

@ZihengJiang
Copy link
Contributor

Ping for update @FrozenGene @Wheest

@Wheest
Copy link
Contributor Author

Wheest commented Sep 16, 2020

@ZihengJiang Have updated linting errors.

Have added asymmetric padding support, however this isn't consistent across the rest of TOPI. Have added a new workload type in python/tvm/topi/nn/conv2d.py, that can be adopted across other Conv2D implementations.

Have been looking at suggested scheduled improvements, however have not been able to get any improvements to date.

Have not yet figured out how to do the kernel packing step in alter_op_layout, but I think I can do when I have some time next week.

@ZihengJiang
Copy link
Contributor

Sounds good! Just ping the reviewers when you feel it is ready to be reviewed again. Thanks for the works!

@tqchen tqchen changed the base branch from master to main October 11, 2020 18:26
@Wheest
Copy link
Contributor Author

Wheest commented Dec 21, 2020

Hello there, updating this pull request to be up-to-date with the latest main branch.

In terms of things remaining to do:

In the interests of more transparent development, here's part of my test suite.

@FrozenGene
Copy link
Member

Hello there, updating this pull request to be up-to-date with the latest main branch.

In terms of things remaining to do:

In the interests of more transparent development, here's part of my test suite.

Sorry, @Wheest I just noticed your mention. Could you give me some spare time to review your code? I could allocate some buffer next week.

@Wheest
Copy link
Contributor Author

Wheest commented Dec 26, 2020

Sure thing, thanks for reviewing! You can see my ongoing PR for the asymmetric padding here

@merrymercy
Copy link
Member

merrymercy commented Dec 26, 2020

We can also try the auto-scheduler. It should deliver very good performance for group convolution.
The benefit we get is that we don't need to worry about the schedule templates anymore.
The non-trivial part of the implementation is to enable layout rewrite for group conv2d, which is critical to performance. If you guys are willing to try it, I can give more insructions.

@Wheest
Copy link
Contributor Author

Wheest commented Dec 26, 2020

Hi @merrymercy, I'm just reading about the auto-scheduler now, is the RFC the best learning resource? How does it compare to Ansor?

I'm willing to give it a try, though perhaps it would be better to get this PR merged first (since it's providing an immediate and clear performance improvement over the existing solution), and add improvements as another one?

@merrymercy
Copy link
Member

merrymercy commented Dec 26, 2020

The RFC you mentioned is very old and that project is deprecated. We started another project "Ansor" to bring auto-scheduler for TVM. Ansor is integrated as tvm.auto_scheduler package in the current code base.
"Ansor" and "Auto-scheduler" in the current code base are the same thing. They are just different names. You can see the RFC for Ansor auto-scheduler here, but the RFC is also slightly old. The best thing to learn it is these tutorials.

I totally agree with you we can merge this PR first.

@FrozenGene
Copy link
Member

FrozenGene commented Dec 29, 2020

Pack in alter_op_layout for kernel: have been working on this, but have an issue. My data is being passed to my group_conv2d_NCHWc.x86 in the conv2d_NCHWc format (5D input data), rather than the GSPC format (6D input data). Despite my changes to the x86 _alter_conv2d_layout. See this branch. Some guidance or pointers would be appreciated @FrozenGene.

@Wheest Not very clearly understand you problem. inside this pass, we should only handle kernel layout, not related with data layout. Could you explain it a bit more clearly?

Comment on lines 178 to 214
if not is_auto_scheduler_enabled():
logger.warning("group_conv2d is not optimized for x86 with autotvm.")
strategy.add_implementation(
wrap_compute_conv2d(topi.nn.group_conv2d_nchw, has_groups=True),
wrap_topi_schedule(topi.generic.schedule_group_conv2d_nchw),
name="group_conv2d_nchw.generic",
)
Copy link
Member

@FrozenGene FrozenGene Dec 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should compare your implementation performance with auto scheduler. We shouldn't delete auto scheduler support directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of strategy logic would include both? Having this code on an else branch of if not is_auto_scheduler_enabled():? Or something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not is_auto_scheduler_enabled(): ... We prefer auto scheduler be the default stragegy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current code should now be able to run with auto scheduler by running with "relay.backend.use_auto_scheduler": True, and otherwise using my schedule.

python/tvm/relay/op/strategy/x86.py Outdated Show resolved Hide resolved
python/tvm/relay/op/strategy/x86.py Outdated Show resolved Hide resolved

# no stride and padding info here
padding = infer_pad(data, data_pad)
hpad, wpad = padding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be pad_t, pad_l, pad_b, pad_r?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only get hpad and wpad from nn.utils.infer_pad, since this information can't be recovered from data and data_pad alone. This has no effect on the schedule, since we only use it to get the DOPAD variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fix the issue of nn.utils.infer_pad as we support asymmetric padding for now. Previous nn.utils.infer_pad only return 2 values is the history reason. However, as it doesn't affect performance and correctness, it is up to you to decide whether to support it or not in this pr.

Copy link
Contributor Author

@Wheest Wheest Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think I will figure out nn.utils.infer_pad as a separate PR this weekend, since there are a couple of design issues around it that need to be answered.

The conv2d codebase has changed a lot since I last looked at it. I did some digging, and found that the non-grouped version of spatial pack convolution has dropped using nn.utils.infer_pad, so I have updated the GSPC code to follow a similar convention.

Comment on lines 208 to 221
if not is_auto_scheduler_enabled():
logger.warning("group_conv2d is not optimized for x86 with autotvm.")
strategy.add_implementation(
wrap_compute_conv2d(topi.nn.group_conv2d_nchw, has_groups=True),
wrap_topi_schedule(topi.generic.schedule_group_conv2d_nchw),
name="group_conv2d_nchw.generic",
wrap_compute_conv2d(topi.x86.group_conv2d_nchw, has_groups=True),
wrap_topi_schedule(topi.x86.schedule_group_conv2d_nchw),
name="group_conv2d_nchw.x86",
)
elif layout == "NHWC":
assert kernel_layout == "HWIO"
if not is_auto_scheduler_enabled():
logger.warning("group_conv2d is not optimized for x86 with autotvm.")
strategy.add_implementation(
wrap_compute_conv2d(topi.nn.group_conv2d_nhwc, has_groups=True),
wrap_topi_schedule(topi.generic.schedule_group_conv2d_nhwc),
name="group_conv2d_nhwc.generic",
)
strategy.add_implementation(
wrap_compute_conv2d(topi.nn.group_conv2d_nhwc, has_groups=True),
wrap_topi_schedule(topi.generic.schedule_group_conv2d_nhwc),
name="group_conv2d_nhwc.generic",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix has bug when we enable auto scheduler. When auto scheduler is enabled, we can not have any compute declaration for auto scheduler, please fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug has been fixed by removing the indentation.

@FrozenGene
Copy link
Member

@Wheest Seems you have wrong rebase. please refer this documentation : https://tvm.apache.org/docs/contribute/git_howto.html

@Wheest
Copy link
Contributor Author

Wheest commented Mar 24, 2021

@Wheest Seems you have wrong rebase. please refer this documentation : https://tvm.apache.org/docs/contribute/git_howto.html

Believe I have fixed the rebase/commit problems, will try to avoid these in future for a cleaner PR.

@FrozenGene FrozenGene merged commit 7130e80 into apache:main Mar 25, 2021
@FrozenGene
Copy link
Member

Thanks @Wheest Merged now

@Wheest
Copy link
Contributor Author

Wheest commented Mar 25, 2021

Thanks for helping me through my first PR for a major OSS project @FrozenGene. Have learned a lot, and I hope to be a more productive member of the community in future.

@tqchen
Copy link
Member

tqchen commented Mar 25, 2021

Thank you @Wheest ! and @FrozenGene for sheperding

@tqchen tqchen added status: accepted and removed status: need review status: need update need update based on feedbacks labels Mar 25, 2021
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
* integrated with v0.8

* Rebase, and undoing accidental removal of auto scheduler NHWC support

* Added ASF license header

* Minor bug fixes

* Added asymmetric padding support
Fixed linting

* Improve linting

* Better linting, disable final linting checks

* Fixed final linting errors (figured out how to run lint tests locally)

* fixing linter formatting part 1

* fixing linter formatting part 2

* fixing linter formatting part 3

* Update conv2d.py

Fixed merge issue

* Rebase, and update responding to some comments

* Fixed AutoScheduler bug for NHWC case

* removed infer_pad from GSPC

* Rebase, and undoing accidental removal of auto scheduler NHWC support

* Added ASF license header

* Minor bug fixes

* Added asymmetric padding support
Fixed linting

* Improve linting

* Better linting, disable final linting checks

* Fixed final linting errors (figured out how to run lint tests locally)

* Update conv2d.py

Fixed merge issue

* Rebase, and update responding to some comments

* Fixed AutoScheduler bug for NHWC case

* Minor fix

* Fixed removal of infer_pad to no padding

* Fixed unexpected linting error

Co-authored-by: Perry Gibson <[email protected]>
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request May 11, 2021
* integrated with v0.8

* Rebase, and undoing accidental removal of auto scheduler NHWC support

* Added ASF license header

* Minor bug fixes

* Added asymmetric padding support
Fixed linting

* Improve linting

* Better linting, disable final linting checks

* Fixed final linting errors (figured out how to run lint tests locally)

* fixing linter formatting part 1

* fixing linter formatting part 2

* fixing linter formatting part 3

* Update conv2d.py

Fixed merge issue

* Rebase, and update responding to some comments

* Fixed AutoScheduler bug for NHWC case

* removed infer_pad from GSPC

* Rebase, and undoing accidental removal of auto scheduler NHWC support

* Added ASF license header

* Minor bug fixes

* Added asymmetric padding support
Fixed linting

* Improve linting

* Better linting, disable final linting checks

* Fixed final linting errors (figured out how to run lint tests locally)

* Update conv2d.py

Fixed merge issue

* Rebase, and update responding to some comments

* Fixed AutoScheduler bug for NHWC case

* Minor fix

* Fixed removal of infer_pad to no padding

* Fixed unexpected linting error

Co-authored-by: Perry Gibson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants