← Back to research

Project · GPT-2 Inference Engine

KV-cache shifts the bottleneck in transformer inference

CUDA · FlashAttention-2 · Nsight Compute

While building a from-scratch GPT-2 forward pass, I expected matmul to dominate. Profiling told a different story: attention's memory footprint and redundant KV recomputation were the real tax.

What I measured

Using Nsight Compute on NVIDIA A40 nodes, I compared a naive attention path against FlashAttention-2 with KV-cache enabled. End-to-end step latency and DRAM throughput were the primary signals—not peak TFLOPS.

What surprised me

With KV-cache, later tokens reuse stored K/V tensors instead of recomputing the full sequence history. That cut redundant memory writes and improved L2 behavior more than any single matmul tweak alone.

What I changed

I paired KV-cache with shared-memory tiling and swizzling to reduce bank conflicts. The lesson: in inference, algorithmic reuse beats micro-optimizing an already-bound kernel in isolation.