Project · GPT-2 Inference Engine
KV-cache shifts the bottleneck in transformer inference
CUDA · FlashAttention-2 · Nsight Compute
While building a from-scratch GPT-2 forward pass, I expected matmul to dominate. Profiling told a different story: attention's memory footprint and redundant KV recomputation were the real tax.
What I measured
Using Nsight Compute on NVIDIA A40 nodes, I compared a naive attention path against FlashAttention-2 with KV-cache enabled. End-to-end step latency and DRAM throughput were the primary signals—not peak TFLOPS.
What surprised me
With KV-cache, later tokens reuse stored K/V tensors instead of recomputing the full sequence history. That cut redundant memory writes and improved L2 behavior more than any single matmul tweak alone.
What I changed
I paired KV-cache with shared-memory tiling and swizzling to reduce bank conflicts. The lesson: in inference, algorithmic reuse beats micro-optimizing an already-bound kernel in isolation.