Attention Computation

Attention: 2 of 11

May 25, 2026

Library › Attention

You have your query ("who has a dog?"), every neighbor's key, and every neighbor's address. Now you score each neighbor against your query.

Open the interactive diagram ↗

S = Kᵀ × Q computes one score per (neighbor, querier) pair. The neighbor with a dog scores high. The neighbor with a cat scores near zero. S is a seq × seq matrix: every token scoring every other token simultaneously.

The scores are divided by √d (key dimension) before softmax. Without this, dot products grow large as the key dimension increases, and softmax collapses: you'd fixate on one neighbor and ignore everyone else.

A = softmax(S/√d) converts scores into attention weights. Each column of A sums to 1: your attention is a probability distribution over all neighbors. The dog owner gets most of the weight.

F = V × A takes a weighted combination of all addresses. The dog owner's address dominates the output. In practice this blend is the whole point: the model does not pick one token but mixes information from all of them, weighted by relevance.

This step has no learned parameters. It is just matrix multiplications and a softmax. It is also what makes self-attention O(n²): S and A are both seq × seq regardless of key or value dimensions.

Next:
3. Self Attention

AI by Hand ✍️

Discussion about this post

Ready for more?