AI by Hand ✍️

AI by Hand ✍️

Share this post

AI by Hand ✍️
AI by Hand ✍️
MHA, MQA, GQA, MoE-A: More Attention!

MHA, MQA, GQA, MoE-A: More Attention!

Frontier Model Math by hand ✍️

Prof. Tom Yeh's avatar
Prof. Tom Yeh
Aug 11, 2025
∙ Paid
8

Share this post

AI by Hand ✍️
AI by Hand ✍️
MHA, MQA, GQA, MoE-A: More Attention!
Share

More attention for… better attention ✍️

Following last week’s deep dive into how transformers can ignore tokens, this week’s early access we zoom out to the bigger picture—how attention itself has evolved.

I’ve created four new worksheets breaking down the math behind:

  1. Multi-Head Attention (MHA) – the original transformer workhorse.

  2. Multi-Query Attention (MQA) – shares keys and values across heads for speed.

  3. Grouped-Query Attention (GQA) – the middle ground between MHA and MQA, used in GPT-OSS and increasingly in other frontier models.

  4. Mixture-of-Experts Attention (MoE-A) – routes attention to specialized experts for scalability.

📌 Why this matters:
GQA showing up in GPT-OSS is yet another sign of its wide adoption—mirroring trends we’ve seen in PaLM, LLaMA, and other large-scale models. These are no longer niche optimizations; they’re becoming standard in high-performance architectures.

⬇️ Download the worksheets below (for Frontier Subscribers only)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Tom Yeh
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share