Project · ThinkerCUDA

Occupancy wins when tiles fit the SM, not when blocks look big

CUDA · C++ · HPC

Custom 3D convolution and tiled matmul kernels reached ~6× CPU throughput, but only after I stopped chasing block size and started matching tile geometry to shared memory and warp occupancy.

The hypothesis

My first kernels launched large blocks with aggressive unrolling. Occupancy calculators looked acceptable; wall-clock time did not.

The finding

Smaller, coalesced tiles that fit in shared memory kept more warps resident and reduced global-memory round trips. Memory coalescing plus deliberate tile sizes beat a 'bigger launch config' mindset.

Course tie-in

Applied Parallel Programming framed this cleanly: the machine rewards locality. The project made that concrete—occupancy is a constraint solver, not a dial you max out.