AI by Hand ✍️

AI by Hand ✍️

RMS, Group, Layer, Batch Norm, Tensor Parallelism

Frontier Model Math by hand ✍️

Prof. Tom Yeh's avatar
Prof. Tom Yeh
Oct 07, 2025
∙ Paid
3
Share

(P.S. This issue is written for advanced AI engineers and researchers. It is part of the premium Frontier subscription. If you are a beginner, please check out my free lectures, walkthroughs, and Excel exercises. I also share regular announcements of free learning opportunities.)

Last week, we explored RoPE — a 2021 innovation that quickly became the default for positional encoding. This week, we turn to another technique that rose to prominence around the same time: RMSNorm.

And because we covered tensor parallelism a few weeks back, I can now show you RMSNorm’s key advantage — in a multi-GPU setting, it needs only one all-reduce instead of two, a simple change that delivers a clear efficiency gain.

Today, RoPE and RMSNorm stand side by side as two of the most influential transformer upgrades of the past few years. They’re no longer just clever tweaks — they’re foundational components of frontier models like LLaMA, Qwen3, and DeepSeek.

Worksheets

For this week’s Frontier issue, I created eight new worksheets as follows:

  1. BatchNorm – The classic approach that started it all, now mostly seen in CNNs.

  2. LayerNorm – The original choice for transformers, normalizing across features.

  3. RMSNorm – A simplified variant that drops the mean, now standard in many frontier models.

  4. GroupNorm – A middle ground between BatchNorm and LayerNorm, useful in specific architectures.

  5. LayerNorm + Tensor Parallelism – Shows why LayerNorm typically requires two all-reduce operations (to compute mean and variance) when sharded across devices — a key scalability bottleneck.

  6. RMSNorm + Tensor Parallelism – Illustrates how dropping the mean reduces synchronization to a single all-reduce, making RMSNorm significantly more efficient at scale.

  7. Post-LayerNorm (Original Transformer) – The design used in the original 2017 architecture.

  8. Pre-RMSNorm (Modern Transformer) – The modern variant used in LLaMA, DeepSeek, and beyond.

🚀 Next Week: We’ll bring everything together for a full, step-by-step breakdown of Qwen3, one of the most advanced open-source transformers today. All the pieces I’ve covered in the Frontier series will finally click into place.

⬇️ Download the worksheets below (for Frontier Subscribers only)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Tom Yeh
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture