AI by Hand ✍️

AI by Hand ✍️

RoPE vs PE in QKV Self-Attention

Frontier Model Math by hand ✍️

Prof. Tom Yeh's avatar
Prof. Tom Yeh
Sep 30, 2025
∙ Paid
6
Share

(P.S. This issue is written for advanced AI engineers and researchers. It is part of the premium Frontier subscription. If you are a beginner, please check out my free lectures, walkthroughs, and Excel exercises. I also share regular announcements of free learning opportunities.)

Source: DeepSeek V3.2 Technical Report (Sep. 2025)

Yesterday, DeepSeek released version 3.2, formally introducing DSA (DeepSeek Sparse Attention) and releasing both the Python reference implementation and open pre-trained weights. I’ve been busy building an Excel reference implementation to help a partner company grasp the math behind DSA and begin constructing its own version.

For this week’s Frontier issue, I intend to focus on RoPE (Rotary Positional Embedding). The timing of DeepSeek’s 3.2 is perfect. It reinforces why understanding RoPE is no longer optional. RoPE has quickly become a de facto standard in modern transformer architectures since it was published in April 2021, and its influence is only growing.

I created four new sets of worksheets to illustrate the following:

  1. Sine & Cosine Positional Encoding Patterns

  2. Positional Encoding (PE) vs. Rotary Positional Embedding (RoPE)

  3. QKV Self-Attention + Positional Encoding (PE)

  4. QKV Self-Attention + Rotary Positional Embedding (RoPE)

Once I finish explaining RoPE, only a few more building blocks remain — like RMSNorm and SwiGLU — before I’ll have unpacked all the key pieces needed to do a full architecture breakdown. My plan is to work toward a complete Qwen3 breakdown in the coming weeks.

⬇️ Download the worksheets below (for Frontier Subscribers only)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Tom Yeh
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture