[CUDA] Improve adaptive and global pool schedule #8936
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current GPU adaptive pool schedule doesn't parallelize across H and W dimensions. This is fine for spatially small inputs as in resnet, such as (1, 1024, 7, 7). But recent architectures such as efficientnet applies adaptive (global) pooling on large inputs. In particular, efficientdet models from the TF2 detection zoo has workloads such as (1, 32, 378, 378). This results in the following sequential kernel and hence efficientdet models run extremely slow when converted to TVM.
I made two modifications to the adaptive pool schedule:
sum(..., axis=[2, 3])
, for example. So the existing GPU reduction schedule should be used as is.(N, C, pool_x, pool_y)
, we simply create N * C * pool_x * pool_y parallel work. Each thread would compute reduction over the corresponding input subwindow.Performance results on CUDA + Geforce MX250 (laptop GPU)
All numbers in milli seconds