Self Attention
Attention series: 3 of 11
Attention Series:
Self Attention (Vaswani et al., 2017) lets each position in a sequence look at every other position to decide what's relevant. The input X is projected through three learned weight matrices — Wq, Wk, Wv — to produce Queries (Q), Keys (K), and Values (V). The score matrix S = Kt × Q measures how much each position attends to every other, producing a seq × seq attention map. After softmax, the attention weights A select a weighted combination of Values to produce the output F = V × A.
Paid members: open the interactive diagram below ↓


