[CUDA] Improve adaptive and global pool schedule #8936

masahi · 2021-09-06T09:50:46Z

The current GPU adaptive pool schedule doesn't parallelize across H and W dimensions. This is fine for spatially small inputs as in resnet, such as (1, 1024, 7, 7). But recent architectures such as efficientnet applies adaptive (global) pooling on large inputs. In particular, efficientdet models from the TF2 detection zoo has workloads such as (1, 32, 378, 378). This results in the following sequential kernel and hence efficientdet models run extremely slow when converted to TVM.

extern "C" __global__ void __launch_bounds__(64) tvmgen_default_fused_nn_adaptive_avg_pool2d_kernel0(float* __restrict__ placeholder, float* __restrict__ tensor) {
  float tensor1[1];
  tensor1[(0)] = 0.000000e+00f;
  for (int rv0 = 0; rv0 < 378; ++rv0) {
    for (int rv1 = 0; rv1 < 378; ++rv1) {
      if (((int)threadIdx.y) < 1) {
        tensor1[(0)] = (tensor1[(0)] + placeholder[((((((((int)threadIdx.y) * 4572288) + (((int)blockIdx.x) * 1143072)) + (((int)threadIdx.x) * 142884)) + (rv0 * 378)) + rv1))]);
      }
    }
  }
  if (((int)threadIdx.y) < 1) {
    tensor[((((((int)threadIdx.y) * 32) + (((int)blockIdx.x) * 8)) + ((int)threadIdx.x)))] = (tensor1[(0)] * 6.998684e-06f);
  }
}

I made two modifications to the adaptive pool schedule:

For the common case where the output size is (1, 1) (aka global pooling), adaptive pooling is equivalent to sum(..., axis=[2, 3]), for example. So the existing GPU reduction schedule should be used as is.
For other cases, simply parallelize across all axes. For example, when the output shape is (N, C, pool_x, pool_y), we simply create N * C * pool_x * pool_y parallel work. Each thread would compute reduction over the corresponding input subwindow.

Performance results on CUDA + Geforce MX250 (laptop GPU)
All numbers in milli seconds

Workload	out size	TVM before	TVM after	Torch
(1, 1024, 7, 7)	(1, 1)	0.031	0.029	0.059
(1, 32, 378, 378)	(1, 1)	4.48	0.84	0.66
(1, 32, 378, 378)	(16, 16)	1.31	0.39	1.51

comaniac · 2021-09-06T16:08:51Z

Thanks @masahi @vinx13

masa added 4 commits September 6, 2021 16:00

use reduction schedule for adaptive pool

ab01583

improve adaptive pool schedule

f81e5bb

simplify

87a10ec

fixed global pooling with fused ops

5c68288

masahi requested review from Huyuwei, jcf94, jwfromm, kevinthesun, Laurawly, mbrookhart, vinx13, yzhliu and ZihengJiang as code owners September 6, 2021 09:50

vinx13 approved these changes Sep 6, 2021

View reviewed changes

comaniac approved these changes Sep 6, 2021

View reviewed changes

comaniac merged commit 054e2bb into apache:main Sep 6, 2021

ylc pushed a commit to ylc/tvm that referenced this pull request Sep 29, 2021

[CUDA] Improve adaptive and global pool schedule (apache#8936)

faab258

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022

[CUDA] Improve adaptive and global pool schedule (apache#8936)

70ff8e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Improve adaptive and global pool schedule #8936

[CUDA] Improve adaptive and global pool schedule #8936

masahi commented Sep 6, 2021 •

edited

Loading

comaniac commented Sep 6, 2021

[CUDA] Improve adaptive and global pool schedule #8936

[CUDA] Improve adaptive and global pool schedule #8936

Conversation

masahi commented Sep 6, 2021 • edited Loading

comaniac commented Sep 6, 2021

masahi commented Sep 6, 2021 •

edited

Loading