This repository is a collection of notes, diagrams, and kernels that I am compiling (no pun intended!) to better understand GPU programming. To that end, I focus mainly on implementing GPU kernels in CUDA C and Triton.
- Introduction to GPU compute for CUDA-capable GPUs. Covers parallel computing terms including kernels, streaming multiprocessors (SMs), CUDA cores, threads, warps, thread blocks, grids.
- Introduction to GPU memory. Covers concepts including registers, L1 cache, L2 cache, shared memory, global memory, memory clock rate, memory bus width, peak memory bandwidth.
- "Hello, World!".
- SAXPY (single-precision A*X Plus Y).
- Matrix multiplication.
- Matrix multiplication with cache tiling.
- Matrix multiplication kernel where each thread computes one row of the output matrix.
- Matrix multiplication kernel where each thread computes one column of the output matrix.
- Matrix-vector multiplication kernel.
- 1D convolution.
- 1D convolution with constant memory.
- 1D convolution with tiling.
- 2D convolution.
- Sum reduction: interleaved addressing with warp divergence.
- Sum reduction: interleaved addressing with shared memory bank conflicts.
- Sum reduction: sequential addressing.
- Sum reduction: first sum during load from global memory.
- Sum reduction: unrolling of the last warp using SIMD execution.
- Sum reduction using Cooperative Groups (CUDA 9 and above).
- Pointwise ops: ReLU.
- Pointwise ops: ReLU with shared memory.
- Program that extracts the properties of the attached CUDA device(s).
- CUDA Streams. See here and here and here.
To run the CUDA scripts in this repo, you will need to be set up with a host machine that has a CUDA-enabled GPU and nvcc
installed.
In general, you can compile and execute a CUDA source file as follows:
nvcc /path/to/source.cu -o /path/to/executable -run
For example, you can run the "Hello, World!" kernel using:
nvcc src/hello_world.cu -o hello_world -run
Note that .cu
is the required file extension for CUDA-accelerated programs.
See the Makefile for a more complete list of commands you can run.
To query the amount of resources available for your device, run:
nvcc src/device_info.cu -o device_info -run
- CUDA C++ Programming Guide (v.11.2)
- CUDA C Programming Guide (v.9.1)
- Cornell Virtual Workshop: Design: GPU vs. CPU
- Cornell Virtual Workshop: Performance: GPU vs. CPU
- Cornell Virtual Workshop: Heterogeneous Applications
- Cornell Virtual Workshop: Threads and Cores Redefined
- Cornell Virtual Workshop: SIMT and Warps
- Cornell Virtual Workshop: Kernels and SMs
- Cornell Virtual Workshop: Memory Levels
- Cornell Virtual Workshop: Memory Types
- An Easy Introduction to CUDA C and C++
- Introduction to GPU programming with CUDA (C/C++)
- CUDA – Dimensions, Mapping and Indexing
- CUDA Crash Course by CoffeeBeforeArch
- From Scratch: Matrix Multiplication in CUDA
- From Scratch: Cache Tiled Matrix Multiplication in CUDA
- Programming Massively Parallel Processors (4th Edition).