-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization for subpixel layer on Tensor core #4523
Comments
You're right that our current handling of depth2space (and space2depth) isn't optimal. We should probably have a dedicated relay op and topi implementation that directly rearranges values rather than stacking reshapes and transposes. I can try to take a stab at implementing this in the next few days. |
I think it is okay that we use reshapes and transoises only if we could change the parameters when the data layout change. Some models, like EDSR, sandwich conv between subpixel for 4x scale, hurting the perforamce even worse. But if the new op can support both NCHW and NHWC, it will be nice. xD |
See PR #4566 for my implementation of native sub pixel operators. |
After testing it, I am happy to let you know, we have no significant difference at all. And I even found a bug. xD TVM build on Win10 MSVC, CUDA 10.1, Test with RTX 2060 Super
|
Maybe the problem is just that this op shouldnt be fused when running on tensor cores. I suspect the access pattern just fundamentally causes bad behavior in the kernel. Can you try changing its fusion pattern to |
We should decode it to str. I think we should add more testcases for importing models. |
I cannot see any difference with My thought is that, for every ops, someone (likey the gpu driver) warps data layout transpose before and after to every op, since tensor core like NCHW. However, I remembered that NHWC input for TVM is even slower than we do all the NCHW to NHWC transpose for tensor core, and we also need #4456 for NHWC conv to work. |
@kice The ConvertLayout pass is merged now. I think you can give a try to convert the whole graph from NCHW to NHWC, if thats the current bottleneck for you. Currently, the conversion is supported only from NHWC -> NCHW. But, we can augment the pass to do the other way round as well. |
@kice could you confirm whether ConvertLayout helps to reduce the overhead? |
ConvertLayout only support NHWC(tf) -> NCHW(tvm), but I used mxnet to make the layout conversion manually, and it turned out even slower. I think there is something to do with NVIDIA driver, I can still see layout conversion in nvprof. |
https://docs.tvm.ai/dev/convert_layout.html Please use the above tutorial to add support for nchw to nhwc layout. |
Close for now as depth2space is implemented. Please feel free to open new threads, or create discussions on the forum |
I found that Depth_to_Space layer spend too much time on changing data layout (NHWC <-> NCHW) while using tensor core. It takes up to 25% of the run time to do the transpose.
Is it possible to reduce this kind of unnecessary data manipulation, like combining reshape and/or transpose into one op.
A sample network
Then it will do
I think nchwToNhwc is done automatically by CUDA, maybe we could convert the whole to NHWC before using tensor core will be a better choice.
Or some features like this PR #4335
The text was updated successfully, but these errors were encountered: