Improve NHWC depthwise convolution for AArch64 #6095

giuseros · 2020-07-20T19:13:15Z

High level description of this contribution

We created a default schedule (no auto-tuning or tensorization) named
depthwise_conv2d_nhwc which does a decent job at optimizing depthwise
for NHWC layouts (on AArch64 architectures).

The schedule lives in : topi/python/topi/arm_cpu/depthwise_conv2d.py
While we register the strategy in : python/tvm/relay/op/strategy/arm_cpu.py

This contribution is based on the following RFC: https://discuss.tvm.ai/t/rfc-improve-depthwise-convolution-for-nhwc-layouts-on-aarch64-targets/7360

giuseros · 2020-07-20T19:16:41Z

cc @u99127 @anijain2305 @FrozenGene

src/relay/op/tensor/reduce.cc

topi/python/topi/arm_cpu/depthwise_conv2d.py

giuseros · 2020-07-21T09:16:02Z

Hi @FrozenGene ,
Before introducing tuning knobs, I wanted to do first an analysis to find the minimum set of tuning parameters to bring the best performance.

The aim is to reduce the tuning time. The point is that sometimes we are constrained by the number of registers available in AArch64, so trying out different splits might only increase the tuning time, without giving any benefit.

So the idea was to have a "default" schedule which mimics ACL implementation and then introduce (the minimal set of) tuning knobs + tensorization to speed things up.

What do you think? If you want to add tuning knobs in this PR, I will try to do the tuning analysis today

FrozenGene · 2020-07-21T09:51:04Z

ACL implementation

Hi @giuseros Thanks for the work. I fully understand your purpose and smoothy development path. As this schedule will be the default NHWC depthwise convolution, my opinion is we should try to achieve a good performance as far as we could achieve. Notably I don't mean we mush achieve like ACL ultimate performance then we could merge, optimization is not one-shot deal. But here I think we could enable auto tvm to help us to achieve better performance. I think it is worthy introducing into this pr.

This schedule will be applied for arm32 and arm64 both, we shouldn't only consider arm64. So auto tvm (split) could help us to avoid this issue.
Tuning knob of compute_at (especially data_pad) could help us solve parallel-compute-locality issue (we can not assume we only run kernel only in one single core). see more detail: http://people.csail.mit.edu/jrk/halide-pldi13.pdf Figure 2

I agree we should reduce tuning knob and improve tuning time experience, but if it could help us improve performance, I think we should introduce it in, otherwise we could avoid it.

giuseros · 2020-07-21T15:10:32Z

Hi @FrozenGene ,

Let me thank you for the review and the pointers (the Halide paper is quite interesting)!

I tried most of your suggestions and all I got was a tiny improvement on mobilenetv2 but a significant increase in tuning time.

Since this looks like it will take more time for more analysis I would prefer if we could take this PR in as is and I can follow up with further improvements in the future.

giuseros · 2020-07-21T16:27:25Z

Hi @tqchen ,

Is this an issue with my changes or with the CI? It seems to point to an import in a particular file, but I am not able to see anything wrong with that import.

topi/python/topi/arm_cpu/depthwise_conv2d.py

FrozenGene · 2020-07-22T04:58:56Z

Hi @FrozenGene ,

Let me thank you for the review and the pointers (the Halide paper is quite interesting)!

I tried most of your suggestions and all I got was a tiny improvement on mobilenetv2 but a significant increase in tuning time.

Since this looks like it will take more time for more analysis I would prefer if we could take this PR in as is and I can follow up with further improvements in the future.

I agree. We could limit our tuning knob if we find it does no effect.

FrozenGene · 2020-07-22T05:01:38Z

Hi @tqchen ,

Is this an issue with my changes or with the CI? It seems to point to an import in a particular file, but I am not able to see anything wrong with that import.

Seems not related with your change. Others meet this CI error too.

giuseros · 2020-07-23T12:47:48Z

Hi @FrozenGene ,
Final changes to this PR.

Introducing the compute_at knobs forces us to use xgb_knob tuner which does not support locate_cache annotations (so I switched to a custom knob).
I also noticed that we were not legalizing depthwsie, which means we were running pooling, reductions, etc, in order to compute the offset contribution. Legalizing depthwise gives a boost of 2x (for quantized).

If this won't be approved by tonight, I will turn into a draft to pick up after I come back from holidays (i.e., 15 days).

topi/python/topi/arm_cpu/depthwise_conv2d.py

FrozenGene

LGTM

anijain2305

LGTM. Just one minor comment.

python/tvm/relay/op/strategy/arm_cpu.py

giuseros · 2020-07-23T21:09:45Z

I probably enabled a CUDA test that was making the CI hang. I reverted the test hoping that this was the issue

giuseros · 2020-07-23T22:40:56Z

@FrozenGene , @anijain2305 ,
This is my last commit before holidays. I enabled only the arm_cpu tests (everything passes locally). The case with dilation>1 will be untested for now (as it is for CUDA). I hope this will make CI happy. If not, I will pick this up when I am back.

We created a default schedule (no auto-tuning or tensorization) named depthwise_conv2d_nhwc which does a decent job at optimizing depthwise for NHWC layouts (on aarch64). Change-Id: I01e32903f6c1950623f33eae18484e70244fe0af

Change-Id: I15080e7f12b16e6c6aba99a04e42023845eeabf1

Change-Id: If12a6d05dce9153861550ddef1ee5216809dd1e1

Change-Id: I7e2062a40358bf111c0366a449945eb077fb2e30

Change-Id: I4b82c58b167e40b0b7747d28293bbb488c505dd9

Change-Id: Idf8eeaaface5eb7799109cd00f437e404778b9cd

Change-Id: Iac16a8daea1268f0eb331fe4ec18a62408106cf9

Change-Id: I1412f22ad9864273d77a7bf38a6768694339b7f0

Change-Id: Ica3eff8f9f0fd4c6f32f7ae80adc922f8b16cec9

Change-Id: Icbaafcb39e892a5d1a4685133c1699e4d1a8e07e

Change-Id: Ibb23f1d4e0d0107e4e3b3571437161cdc2ee2909

giuseros · 2020-08-13T14:19:38Z

Hi @FrozenGene , @anijain2305 ,
This PR finally passed the CI. Would it be possible to merge it?

Thanks!

FrozenGene · 2020-08-13T14:41:58Z

Thanks @giuseros Merged now

* Improve NHWC depthwise convolution for aarch64 We created a default schedule (no auto-tuning or tensorization) named depthwise_conv2d_nhwc which does a decent job at optimizing depthwise for NHWC layouts (on aarch64). Change-Id: I01e32903f6c1950623f33eae18484e70244fe0af * Add tuning knobs in depthwise schedule Change-Id: I15080e7f12b16e6c6aba99a04e42023845eeabf1 * Introduce padding policy Change-Id: If12a6d05dce9153861550ddef1ee5216809dd1e1 * Vectorize padding Change-Id: I7e2062a40358bf111c0366a449945eb077fb2e30 * Legalize depthwise convolution (2x improvement) and fix tuning issue Change-Id: I4b82c58b167e40b0b7747d28293bbb488c505dd9 * Adding assert on padding Change-Id: Idf8eeaaface5eb7799109cd00f437e404778b9cd * Fix python linting Change-Id: Iac16a8daea1268f0eb331fe4ec18a62408106cf9 * Removing commented code Change-Id: I1412f22ad9864273d77a7bf38a6768694339b7f0 * Revert test file to make CI pass Change-Id: Ica3eff8f9f0fd4c6f32f7ae80adc922f8b16cec9 * Enabling only arm_cpu tests Change-Id: Icbaafcb39e892a5d1a4685133c1699e4d1a8e07e * Rebasing Change-Id: Ibb23f1d4e0d0107e4e3b3571437161cdc2ee2909

FrozenGene requested changes Jul 21, 2020

View reviewed changes

giuseros force-pushed the depthwise_improvements branch from 1ca776e to cc43c78 Compare July 21, 2020 14:30

FrozenGene reviewed Jul 22, 2020

View reviewed changes

topi/python/topi/arm_cpu/depthwise_conv2d.py Outdated Show resolved Hide resolved

FrozenGene reviewed Jul 23, 2020

View reviewed changes

topi/python/topi/arm_cpu/depthwise_conv2d.py Outdated Show resolved Hide resolved

FrozenGene approved these changes Jul 23, 2020

View reviewed changes

anijain2305 approved these changes Jul 23, 2020

View reviewed changes

python/tvm/relay/op/strategy/arm_cpu.py Outdated Show resolved Hide resolved

FrozenGene mentioned this pull request Jul 31, 2020

Better grouped convolution for CPU targets #6137

Merged

Giuseppe Rossini added 11 commits August 12, 2020 10:39

Improve NHWC depthwise convolution for aarch64

407bd6d

We created a default schedule (no auto-tuning or tensorization) named depthwise_conv2d_nhwc which does a decent job at optimizing depthwise for NHWC layouts (on aarch64). Change-Id: I01e32903f6c1950623f33eae18484e70244fe0af

Add tuning knobs in depthwise schedule

0385a61

Change-Id: I15080e7f12b16e6c6aba99a04e42023845eeabf1

Introduce padding policy

edc6c1b

Change-Id: If12a6d05dce9153861550ddef1ee5216809dd1e1

Vectorize padding

a317d76

Change-Id: I7e2062a40358bf111c0366a449945eb077fb2e30

Legalize depthwise convolution (2x improvement) and fix tuning issue

30a1204

Change-Id: I4b82c58b167e40b0b7747d28293bbb488c505dd9

Adding assert on padding

bf976b4

Change-Id: Idf8eeaaface5eb7799109cd00f437e404778b9cd

Fix python linting

69d1f11

Change-Id: Iac16a8daea1268f0eb331fe4ec18a62408106cf9

Removing commented code

c53865d

Change-Id: I1412f22ad9864273d77a7bf38a6768694339b7f0

Revert test file to make CI pass

849e280

Change-Id: Ica3eff8f9f0fd4c6f32f7ae80adc922f8b16cec9

Enabling only arm_cpu tests

ec13395

Change-Id: Icbaafcb39e892a5d1a4685133c1699e4d1a8e07e

Rebasing

6ef6572

Change-Id: Ibb23f1d4e0d0107e4e3b3571437161cdc2ee2909

giuseros force-pushed the depthwise_improvements branch from 2dc9a22 to 6ef6572 Compare August 12, 2020 11:01

tqchen assigned FrozenGene Aug 12, 2020

tqchen assigned anijain2305 Aug 12, 2020

FrozenGene merged commit b6b5ace into apache:master Aug 13, 2020

ZihengJiang mentioned this pull request Sep 25, 2020

TVM v0.7 Release Note Candidate #6486

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve NHWC depthwise convolution for AArch64 #6095

Improve NHWC depthwise convolution for AArch64 #6095

giuseros commented Jul 20, 2020 •

edited

Loading

giuseros commented Jul 20, 2020

giuseros commented Jul 21, 2020

FrozenGene commented Jul 21, 2020 •

edited

Loading

giuseros commented Jul 21, 2020

giuseros commented Jul 21, 2020

FrozenGene commented Jul 22, 2020

FrozenGene commented Jul 22, 2020

giuseros commented Jul 23, 2020

FrozenGene left a comment

anijain2305 left a comment

giuseros commented Jul 23, 2020

giuseros commented Jul 23, 2020

giuseros commented Aug 13, 2020

FrozenGene commented Aug 13, 2020

Improve NHWC depthwise convolution for AArch64 #6095

Improve NHWC depthwise convolution for AArch64 #6095

Conversation

giuseros commented Jul 20, 2020 • edited Loading

High level description of this contribution

giuseros commented Jul 20, 2020

giuseros commented Jul 21, 2020

FrozenGene commented Jul 21, 2020 • edited Loading

giuseros commented Jul 21, 2020

giuseros commented Jul 21, 2020

FrozenGene commented Jul 22, 2020

FrozenGene commented Jul 22, 2020

giuseros commented Jul 23, 2020

FrozenGene left a comment

Choose a reason for hiding this comment

anijain2305 left a comment

Choose a reason for hiding this comment

giuseros commented Jul 23, 2020

giuseros commented Jul 23, 2020

giuseros commented Aug 13, 2020

FrozenGene commented Aug 13, 2020

giuseros commented Jul 20, 2020 •

edited

Loading

FrozenGene commented Jul 21, 2020 •

edited

Loading