This code implements fast cuda kernels for DNN inference, especially for convolution layers / residule blocks in ResNet. Specifically, the kernels combine three parts into one piece:
- Convolution
- Batch Nomalization (BN + Scale)
- Activation (ReLU)
For implementation details, please refer to the technical report included in this repo. Winograd algorithm is used for 3 * 3 convolutional kernels.
mkdir data
python data_generator.py
make
./Test 0
- Set parameters in
data_generator.py
- Run 6 test cases with changing numbers from 0 to 5 after
./Test
Kernals | Operations | 128 / 128 | 256 / 256 |
---|---|---|---|
Cudnn | Gemm + BN + ReLU | 214us | 384us |
Cudnn | Winograd + BN + ReLU | 95us | 155us |
Our Kernel | Winograd + BN + ReLU | 59us | 117us |
Kernals | 512 / 128 | 128 / 512 | 1024 / 256 | 256 / 1024 |
---|---|---|---|---|
Operations | Gemm + BN + ReLU | Gemm + BN | Gemm + BN + ReLU | Gemm + BN + ReLU |
Cudnn | 119us | 115us | 219us | 214us |
Our Kernel | 58us | 55us | 186us | 181us |