Multi-Query Attention
Attention series: 10 of 11
Attention Series:
Multi-Query Attention (Shazeer, 2019) speeds up inference by sharing a single set of Key and Value projections across all query heads. In standard Multi-Head Attention, each head has its own Q, K, and V — tripling the memory needed for the KV cache. MQA keeps multiple independent Query heads — three in this example — but computes K and V just once. Each query head has its own learned weights but must match the shared Key dimension for the dot product to work. This cuts the KV cache by a factor equal to the number of heads, making autoregressive decoding significantly faster with surprisingly little quality loss.
Paid members: open the interactive diagram below ↓


