Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve NHWC depthwise convolution for AArch64 #6095

Merged
merged 11 commits into from
Aug 13, 2020

Conversation

giuseros
Copy link
Contributor

@giuseros giuseros commented Jul 20, 2020

High level description of this contribution

We created a default schedule (no auto-tuning or tensorization) named
depthwise_conv2d_nhwc which does a decent job at optimizing depthwise
for NHWC layouts (on AArch64 architectures).

The schedule lives in : topi/python/topi/arm_cpu/depthwise_conv2d.py
While we register the strategy in : python/tvm/relay/op/strategy/arm_cpu.py

This contribution is based on the following RFC: https://discuss.tvm.ai/t/rfc-improve-depthwise-convolution-for-nhwc-layouts-on-aarch64-targets/7360

@giuseros
Copy link
Contributor Author

cc @u99127 @anijain2305 @FrozenGene

src/relay/op/tensor/reduce.cc Outdated Show resolved Hide resolved
topi/python/topi/arm_cpu/depthwise_conv2d.py Outdated Show resolved Hide resolved
topi/python/topi/arm_cpu/depthwise_conv2d.py Outdated Show resolved Hide resolved
topi/python/topi/arm_cpu/depthwise_conv2d.py Outdated Show resolved Hide resolved
topi/python/topi/arm_cpu/depthwise_conv2d.py Outdated Show resolved Hide resolved
topi/python/topi/arm_cpu/depthwise_conv2d.py Outdated Show resolved Hide resolved
topi/python/topi/arm_cpu/depthwise_conv2d.py Outdated Show resolved Hide resolved
@giuseros
Copy link
Contributor Author

Hi @FrozenGene ,
Before introducing tuning knobs, I wanted to do first an analysis to find the minimum set of tuning parameters to bring the best performance.

The aim is to reduce the tuning time. The point is that sometimes we are constrained by the number of registers available in AArch64, so trying out different splits might only increase the tuning time, without giving any benefit.

So the idea was to have a "default" schedule which mimics ACL implementation and then introduce (the minimal set of) tuning knobs + tensorization to speed things up.

What do you think? If you want to add tuning knobs in this PR, I will try to do the tuning analysis today

@FrozenGene
Copy link
Member

FrozenGene commented Jul 21, 2020

ACL implementation

Hi @giuseros Thanks for the work. I fully understand your purpose and smoothy development path. As this schedule will be the default NHWC depthwise convolution, my opinion is we should try to achieve a good performance as far as we could achieve. Notably I don't mean we mush achieve like ACL ultimate performance then we could merge, optimization is not one-shot deal. But here I think we could enable auto tvm to help us to achieve better performance. I think it is worthy introducing into this pr.

  • This schedule will be applied for arm32 and arm64 both, we shouldn't only consider arm64. So auto tvm (split) could help us to avoid this issue.

  • Tuning knob of compute_at (especially data_pad) could help us solve parallel-compute-locality issue (we can not assume we only run kernel only in one single core). see more detail: http://people.csail.mit.edu/jrk/halide-pldi13.pdf Figure 2

I agree we should reduce tuning knob and improve tuning time experience, but if it could help us improve performance, I think we should introduce it in, otherwise we could avoid it.

@giuseros
Copy link
Contributor Author

Hi @FrozenGene ,

Let me thank you for the review and the pointers (the Halide paper is quite interesting)!

I tried most of your suggestions and all I got was a tiny improvement on mobilenetv2 but a significant increase in tuning time.

Since this looks like it will take more time for more analysis I would prefer if we could take this PR in as is and I can follow up with further improvements in the future.

@giuseros
Copy link
Contributor Author

Hi @tqchen ,

Is this an issue with my changes or with the CI? It seems to point to an import in a particular file, but I am not able to see anything wrong with that import.

@FrozenGene
Copy link
Member

Hi @FrozenGene ,

Let me thank you for the review and the pointers (the Halide paper is quite interesting)!

I tried most of your suggestions and all I got was a tiny improvement on mobilenetv2 but a significant increase in tuning time.

Since this looks like it will take more time for more analysis I would prefer if we could take this PR in as is and I can follow up with further improvements in the future.

I agree. We could limit our tuning knob if we find it does no effect.

@FrozenGene
Copy link
Member

Hi @tqchen ,

Is this an issue with my changes or with the CI? It seems to point to an import in a particular file, but I am not able to see anything wrong with that import.

Seems not related with your change. Others meet this CI error too.

@giuseros
Copy link
Contributor Author

Hi @FrozenGene ,
Final changes to this PR.

  • Introducing the compute_at knobs forces us to use xgb_knob tuner which does not support locate_cache annotations (so I switched to a custom knob).
  • I also noticed that we were not legalizing depthwsie, which means we were running pooling, reductions, etc, in order to compute the offset contribution. Legalizing depthwise gives a boost of 2x (for quantized).

If this won't be approved by tonight, I will turn into a draft to pick up after I come back from holidays (i.e., 15 days).

Copy link
Member

@FrozenGene FrozenGene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@anijain2305 anijain2305 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one minor comment.

python/tvm/relay/op/strategy/arm_cpu.py Outdated Show resolved Hide resolved
@giuseros
Copy link
Contributor Author

I probably enabled a CUDA test that was making the CI hang. I reverted the test hoping that this was the issue

@giuseros
Copy link
Contributor Author

@FrozenGene , @anijain2305 ,
This is my last commit before holidays. I enabled only the arm_cpu tests (everything passes locally). The case with dilation>1 will be untested for now (as it is for CUDA). I hope this will make CI happy. If not, I will pick this up when I am back.

Giuseppe Rossini added 11 commits August 12, 2020 10:39
We created a default schedule (no auto-tuning or tensorization) named
depthwise_conv2d_nhwc which does a decent job at optimizing depthwise
for NHWC layouts (on aarch64).

Change-Id: I01e32903f6c1950623f33eae18484e70244fe0af
Change-Id: I15080e7f12b16e6c6aba99a04e42023845eeabf1
Change-Id: If12a6d05dce9153861550ddef1ee5216809dd1e1
Change-Id: I7e2062a40358bf111c0366a449945eb077fb2e30
Change-Id: I4b82c58b167e40b0b7747d28293bbb488c505dd9
Change-Id: Idf8eeaaface5eb7799109cd00f437e404778b9cd
Change-Id: Iac16a8daea1268f0eb331fe4ec18a62408106cf9
Change-Id: I1412f22ad9864273d77a7bf38a6768694339b7f0
Change-Id: Ica3eff8f9f0fd4c6f32f7ae80adc922f8b16cec9
Change-Id: Icbaafcb39e892a5d1a4685133c1699e4d1a8e07e
Change-Id: Ibb23f1d4e0d0107e4e3b3571437161cdc2ee2909
@giuseros
Copy link
Contributor Author

Hi @FrozenGene , @anijain2305 ,
This PR finally passed the CI. Would it be possible to merge it?

Thanks!

@FrozenGene FrozenGene merged commit b6b5ace into apache:master Aug 13, 2020
@FrozenGene
Copy link
Member

Thanks @giuseros Merged now

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Aug 26, 2020
* Improve NHWC depthwise convolution for aarch64

We created a default schedule (no auto-tuning or tensorization) named
depthwise_conv2d_nhwc which does a decent job at optimizing depthwise
for NHWC layouts (on aarch64).

Change-Id: I01e32903f6c1950623f33eae18484e70244fe0af

* Add tuning knobs in depthwise schedule

Change-Id: I15080e7f12b16e6c6aba99a04e42023845eeabf1

* Introduce padding policy

Change-Id: If12a6d05dce9153861550ddef1ee5216809dd1e1

* Vectorize padding

Change-Id: I7e2062a40358bf111c0366a449945eb077fb2e30

* Legalize depthwise convolution (2x improvement) and fix tuning issue

Change-Id: I4b82c58b167e40b0b7747d28293bbb488c505dd9

* Adding assert on padding

Change-Id: Idf8eeaaface5eb7799109cd00f437e404778b9cd

* Fix python linting

Change-Id: Iac16a8daea1268f0eb331fe4ec18a62408106cf9

* Removing commented code

Change-Id: I1412f22ad9864273d77a7bf38a6768694339b7f0

* Revert test file to make CI pass

Change-Id: Ica3eff8f9f0fd4c6f32f7ae80adc922f8b16cec9

* Enabling only arm_cpu tests

Change-Id: Icbaafcb39e892a5d1a4685133c1699e4d1a8e07e

* Rebasing

Change-Id: Ibb23f1d4e0d0107e4e3b3571437161cdc2ee2909
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Aug 26, 2020
* Improve NHWC depthwise convolution for aarch64

We created a default schedule (no auto-tuning or tensorization) named
depthwise_conv2d_nhwc which does a decent job at optimizing depthwise
for NHWC layouts (on aarch64).

Change-Id: I01e32903f6c1950623f33eae18484e70244fe0af

* Add tuning knobs in depthwise schedule

Change-Id: I15080e7f12b16e6c6aba99a04e42023845eeabf1

* Introduce padding policy

Change-Id: If12a6d05dce9153861550ddef1ee5216809dd1e1

* Vectorize padding

Change-Id: I7e2062a40358bf111c0366a449945eb077fb2e30

* Legalize depthwise convolution (2x improvement) and fix tuning issue

Change-Id: I4b82c58b167e40b0b7747d28293bbb488c505dd9

* Adding assert on padding

Change-Id: Idf8eeaaface5eb7799109cd00f437e404778b9cd

* Fix python linting

Change-Id: Iac16a8daea1268f0eb331fe4ec18a62408106cf9

* Removing commented code

Change-Id: I1412f22ad9864273d77a7bf38a6768694339b7f0

* Revert test file to make CI pass

Change-Id: Ica3eff8f9f0fd4c6f32f7ae80adc922f8b16cec9

* Enabling only arm_cpu tests

Change-Id: Icbaafcb39e892a5d1a4685133c1699e4d1a8e07e

* Rebasing

Change-Id: Ibb23f1d4e0d0107e4e3b3571437161cdc2ee2909
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Aug 26, 2020
* Improve NHWC depthwise convolution for aarch64

We created a default schedule (no auto-tuning or tensorization) named
depthwise_conv2d_nhwc which does a decent job at optimizing depthwise
for NHWC layouts (on aarch64).

Change-Id: I01e32903f6c1950623f33eae18484e70244fe0af

* Add tuning knobs in depthwise schedule

Change-Id: I15080e7f12b16e6c6aba99a04e42023845eeabf1

* Introduce padding policy

Change-Id: If12a6d05dce9153861550ddef1ee5216809dd1e1

* Vectorize padding

Change-Id: I7e2062a40358bf111c0366a449945eb077fb2e30

* Legalize depthwise convolution (2x improvement) and fix tuning issue

Change-Id: I4b82c58b167e40b0b7747d28293bbb488c505dd9

* Adding assert on padding

Change-Id: Idf8eeaaface5eb7799109cd00f437e404778b9cd

* Fix python linting

Change-Id: Iac16a8daea1268f0eb331fe4ec18a62408106cf9

* Removing commented code

Change-Id: I1412f22ad9864273d77a7bf38a6768694339b7f0

* Revert test file to make CI pass

Change-Id: Ica3eff8f9f0fd4c6f32f7ae80adc922f8b16cec9

* Enabling only arm_cpu tests

Change-Id: Icbaafcb39e892a5d1a4685133c1699e4d1a8e07e

* Rebasing

Change-Id: Ibb23f1d4e0d0107e4e3b3571437161cdc2ee2909
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Sep 2, 2020
* Improve NHWC depthwise convolution for aarch64

We created a default schedule (no auto-tuning or tensorization) named
depthwise_conv2d_nhwc which does a decent job at optimizing depthwise
for NHWC layouts (on aarch64).

Change-Id: I01e32903f6c1950623f33eae18484e70244fe0af

* Add tuning knobs in depthwise schedule

Change-Id: I15080e7f12b16e6c6aba99a04e42023845eeabf1

* Introduce padding policy

Change-Id: If12a6d05dce9153861550ddef1ee5216809dd1e1

* Vectorize padding

Change-Id: I7e2062a40358bf111c0366a449945eb077fb2e30

* Legalize depthwise convolution (2x improvement) and fix tuning issue

Change-Id: I4b82c58b167e40b0b7747d28293bbb488c505dd9

* Adding assert on padding

Change-Id: Idf8eeaaface5eb7799109cd00f437e404778b9cd

* Fix python linting

Change-Id: Iac16a8daea1268f0eb331fe4ec18a62408106cf9

* Removing commented code

Change-Id: I1412f22ad9864273d77a7bf38a6768694339b7f0

* Revert test file to make CI pass

Change-Id: Ica3eff8f9f0fd4c6f32f7ae80adc922f8b16cec9

* Enabling only arm_cpu tests

Change-Id: Icbaafcb39e892a5d1a4685133c1699e4d1a8e07e

* Rebasing

Change-Id: Ibb23f1d4e0d0107e4e3b3571437161cdc2ee2909
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Sep 3, 2020
* Improve NHWC depthwise convolution for aarch64

We created a default schedule (no auto-tuning or tensorization) named
depthwise_conv2d_nhwc which does a decent job at optimizing depthwise
for NHWC layouts (on aarch64).

Change-Id: I01e32903f6c1950623f33eae18484e70244fe0af

* Add tuning knobs in depthwise schedule

Change-Id: I15080e7f12b16e6c6aba99a04e42023845eeabf1

* Introduce padding policy

Change-Id: If12a6d05dce9153861550ddef1ee5216809dd1e1

* Vectorize padding

Change-Id: I7e2062a40358bf111c0366a449945eb077fb2e30

* Legalize depthwise convolution (2x improvement) and fix tuning issue

Change-Id: I4b82c58b167e40b0b7747d28293bbb488c505dd9

* Adding assert on padding

Change-Id: Idf8eeaaface5eb7799109cd00f437e404778b9cd

* Fix python linting

Change-Id: Iac16a8daea1268f0eb331fe4ec18a62408106cf9

* Removing commented code

Change-Id: I1412f22ad9864273d77a7bf38a6768694339b7f0

* Revert test file to make CI pass

Change-Id: Ica3eff8f9f0fd4c6f32f7ae80adc922f8b16cec9

* Enabling only arm_cpu tests

Change-Id: Icbaafcb39e892a5d1a4685133c1699e4d1a8e07e

* Rebasing

Change-Id: Ibb23f1d4e0d0107e4e3b3571437161cdc2ee2909
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants