Project · ThinkerCUDA
Occupancy wins when tiles fit the SM, not when blocks look big
CUDA · C++ · HPC
Custom 3D convolution and tiled matmul kernels reached ~6× CPU throughput, but only after I stopped chasing block size and started matching tile geometry to shared memory and warp occupancy.
The hypothesis
My first kernels launched large blocks with aggressive unrolling. Occupancy calculators looked acceptable; wall-clock time did not.
The finding
Smaller, coalesced tiles that fit in shared memory kept more warps resident and reduced global-memory round trips. Memory coalescing plus deliberate tile sizes beat a 'bigger launch config' mindset.
Course tie-in
Applied Parallel Programming framed this cleanly: the machine rewards locality. The project made that concrete—occupancy is a constraint solver, not a dial you max out.