Multi-Head Attention

Attention: 7 of 11

May 25, 2026

Library › Attention

QKV Projection
Attention Computation
Self Attention
Cross Attention
Self Attention vs Cross Attention
Self Attention (Shared KV)
Multi-Head Attention
Fused QKV (Multi-Head)
Single vs Multi-Head Attention
Multi-Query Attention
Grouped-Query Attention

So far you have seen one head asking one question. Multi-Head Attention runs several heads in parallel, each asking a different question. One head might ask "who has a dog?", the midnight question you already know. Another might ask "who is still awake right now?" A third might ask "who lives close enough to hear the same bark?" Each head has its own learned Wq, Wk, and Wv, so each one attends to a different aspect of the input.

Open the interactive diagram ↗

In this example, three heads run in parallel. Their outputs are concatenated and projected through a final weight matrix Wo to produce the result. The three heads each contribute a slice of the final representation.

Next:
8. Fused QKV (Multi-Head)

AI by Hand ✍️

Discussion about this post

Ready for more?