More attention for… better attention ✍️
Following last week’s deep dive into how transformers can ignore tokens, this week’s early access we zoom out to the bigger picture—how attention itself has evolved.
I’ve created four new worksheets breaking down the math behind:
Multi-Head Attention (MHA) – the original transformer workhorse.
Multi-Query Attention (MQA) – shares keys and values across heads for speed.
Grouped-Query Attention (GQA) – the middle ground between MHA and MQA, used in GPT-OSS and increasingly in other frontier models.
Mixture-of-Experts Attention (MoE-A) – routes attention to specialized experts for scalability.
📌 Why this matters:
GQA showing up in GPT-OSS is yet another sign of its wide adoption—mirroring trends we’ve seen in PaLM, LLaMA, and other large-scale models. These are no longer niche optimizations; they’re becoming standard in high-performance architectures.
⬇️ Download the worksheets below (for Frontier Subscribers only)