Attention Computation
Attention series: 2 of 11
Attention Series:
After QKV projection, the attention step computes how much each position attends to every other. First, K is transposed so its dimensions align for the dot product: S = Kt × Q produces a seq × seq score matrix. After softmax (A = softmax(S/√d)), the attention weights A select from Values: F = V × A.
The raw scores are divided by √d (key dimension) to prevent dot products from growing too large as the key dimension increases — without this, softmax would saturate and produce near-one-hot outputs. Softmax then converts scores into a probability distribution over positions, so each column of A sums to 1.
This is the step that makes self attention O(n²) — S and A are both seq × seq regardless of key or value dimensions.
Note that this part of calculation does not involve any learned parameters. It's just matrix multiplications and a softmax function. The only "knobs" to turn are the sequence length and key/value dimensions, which affect the shapes of the matrices but not the fundamental operations.
Next:
3. Self Attention


