AI by Hand ✍️

AI by Hand ✍️

Multi-Query Attention

Attention series: 10 of 11

Prof. Tom Yeh's avatar
Prof. Tom Yeh
May 25, 2026
∙ Paid

Attention Series:

  1. QKV Projection

  2. Attention Computation

  3. Self Attention

  4. Cross Attention

  5. Self Attention vs Cross Attention

  6. Self Attention (Shared KV)

  7. Multi-Head Attention

  8. Fused QKV (Multi-Head)

  9. Single vs Multi-Head Attention

  10. Multi-Query Attention

  11. Grouped-Query Attention

Multi-Query Attention (Shazeer, 2019) speeds up inference by sharing a single set of Key and Value projections across all query heads. In standard Multi-Head Attention, each head has its own Q, K, and V — tripling the memory needed for the KV cache. MQA keeps multiple independent Query heads — three in this example — but computes K and V just once. Each query head has its own learned weights but must match the shared Key dimension for the dot product to work. This cuts the KV cache by a factor equal to the number of heads, making autoregressive decoding significantly faster with surprisingly little quality loss.

Paid members: open the interactive diagram below ↓

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 Tom Yeh · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture