AI by Hand ✍️

AI by Hand ✍️

Self Attention

Attention series: 3 of 11

Prof. Tom Yeh's avatar
Prof. Tom Yeh
May 25, 2026
∙ Paid

Attention Series:

  1. QKV Projection

  2. Attention Computation

  3. Self Attention

  4. Cross Attention

  5. Self Attention vs Cross Attention

  6. Self Attention (Shared KV)

  7. Multi-Head Attention

  8. Fused QKV (Multi-Head)

  9. Single vs Multi-Head Attention

  10. Multi-Query Attention

  11. Grouped-Query Attention

Self Attention (Vaswani et al., 2017) lets each position in a sequence look at every other position to decide what's relevant. The input X is projected through three learned weight matrices — Wq, Wk, Wv — to produce Queries (Q), Keys (K), and Values (V). The score matrix S = Kt × Q measures how much each position attends to every other, producing a seq × seq attention map. After softmax, the attention weights A select a weighted combination of Values to produce the output F = V × A.

Paid members: open the interactive diagram below ↓

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 Tom Yeh · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture