Attention
💡Foundation AI Seminar Series
This week’s Foundation seminar focused on rebuilding intuition for attention, starting from the famous paper Attention Is All You Need, which introduced the Transformer and reshaped how modern language models are built.
I began by framing tokens as vectors—units of information—and explained attention as a way to produce new tokens by combining existing ones with different weights. Each output token is simply a weighted mixture of earlier tokens, not a mysterious operation.
We then walked through scaled dot-product attention step by step. Tokens are projected into queries, keys, and values; dot products compare what a token is looking for with what others can provide; scaling keeps the numbers stable; and softmax turns scores into probabilities. I emphasized that matrix multiplication is just many dot products happening in parallel.
From there, we extended the idea to multi-head attention. Instead of one comparison space, multiple heads run in parallel, each learning a different way to relate tokens. Their outputs are concatenated and projected back to the original dimension so the next layer can proceed cleanly.
The main takeaway was simple: attention looks intimidating on paper, but once you unpack it, it’s a clear and systematic way to mix information across tokens—the core idea that makes Transformers work.
Recording & Excel Workbook
The full recording and the associated Excel workbook are available to AI by Hand Academy members. You can become a member via a paid Substack subscription.


