Multi-Query Attention

Attention: 10 of 11

Prof. Tom Yeh

May 25, 2026

Library › Attention

QKV Projection
Attention Computation
Self Attention
Cross Attention
Self Attention vs Cross Attention
Self Attention (Shared KV)
Multi-Head Attention
Fused QKV (Multi-Head)
Single vs Multi-Head Attention
Multi-Query Attention
Grouped-Query Attention

Multi-Query Attention (Shazeer, 2019) is an extension of Shared KV attention to the full multi-head setting.

Open the interactive diagram ↗

In standard Multi-Head Attention, every head has its own Q, K, and V. During autoregressive decoding, K and V from every previous token must be stored in a KV cache. With H heads, that cache is H times larger than a single head's.

MQA keeps H independent query heads but computes K and V just once. Each head can still ask a different question. But they all look up the answers in the same shared directory.

In this example, Head 1 is the full head: it computes Q, K, and V. Heads 2 and 3 each compute their own Q but reuse Head 1's K and V. The KV cache shrinks by a factor of H, with surprisingly little quality loss.

Next:
11. Grouped-Query Attention

AI by Hand ✍️

Discussion about this post

Ready for more?