AI by Hand ✍️

AI by Hand ✍️

Gated Attention ~ NeurIPS 2025 Best Paper

Frontier AI Seminar Series

Prof. Tom Yeh's avatar
Prof. Tom Yeh
Dec 18, 2025
∙ Paid

Earlier this week, I delivered a Frontier AI Seminar diving deep into Gated Attention, winner of the NeurIPS 2025 Best Paper Award.

When a paper wins Best Paper at NeurIPS, where the world’s top AI researchers gather every year, it’s not optional reading—it’s a must-study paper. These are the works that set the research agenda, and skipping them is how you quietly fall behind.

Here’s the outline of this seminar

  1. Introduction

  2. Preliminary: Multi-Head Softmax Attention

    • QKV Linear Projection

    • Scaled Dot-Product Attention (SDPA)

    • Multi-Head Concatenation

    • Final Output Layer

  3. Augmenting Attention Layer with Gating Mechanisms

    • Gating Mechanism

    • Positions

    • Query Gating

    • Key Gating

    • Value Gating

    • SDPA Gating

    • Output Gating

    • Additive Gating

  4. Experiments

    • Main Results

    • Perplexity (PPL)

      • Single Token

      • Token Sequence

  5. Analysis

    • Initial Tokens Are Attention Sink

    • SDPA Output Gating Reduces Attention Sink via Sparsity

Recordings

Scaled Dot Product Attention

I used the opportunity to review foundation topics, such as the Scaled Dot Product Attention. In this video clip, I break down scaled dot-product attention from first principles, showing how queries (Q) are compared against keys (K) to produce attention scores. We then walk step by step through scaling, softmax, and value weighting, with careful attention to matrix shapes, transposes, and where each number actually comes from.

Perplexity (PPL)

Since Gated Attention evaluates performance using Perplexity (PPL), I finally got to demystify this perplexing metric, pun intended. In this video clip, I framed perplexity as a measure of how many choices a model believes it has at each step. We walk through how softmax turns scores into probabilities, why a perfect perplexity of 1 means zero uncertainty, and how different choice distributions directly affect PPL and model performance.

Gated Attention (Full Recording)

⬇️ Download the workbook and watch the full recording of the seminar.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Tom Yeh · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture