Multi-Head Attention
Attention series: 7 of 11
Attention Series:
Multi-Head Attention lets the model attend to different parts of the input simultaneously. Instead of computing a single set of attention scores, MHA runs multiple attention "heads" in parallel — three in this example — each with its own learned Q, K, and V projections. Each head captures different relationships in the data. Their outputs are concatenated and projected through a final weight matrix to produce the result.
Paid members: open the interactive diagram below ↓


