Gated Attention ~ NeurIPS 2025 Best Paper
Frontier AI Seminar Series
Earlier this week, I delivered a Frontier AI Seminar diving deep into Gated Attention, winner of the NeurIPS 2025 Best Paper Award.
When a paper wins Best Paper at NeurIPS, where the world’s top AI researchers gather every year, it’s not optional reading—it’s a must-study paper. These are the works that set the research agenda, and skipping them is how you quietly fall behind.
Here’s the outline of this seminar
Introduction
Preliminary: Multi-Head Softmax Attention
QKV Linear Projection
Scaled Dot-Product Attention (SDPA)
Multi-Head Concatenation
Final Output Layer
Augmenting Attention Layer with Gating Mechanisms
Gating Mechanism
Positions
Query Gating
Key Gating
Value Gating
SDPA Gating
Output Gating
Additive Gating
Experiments
Main Results
Perplexity (PPL)
Single Token
Token Sequence
Analysis
Initial Tokens Are Attention Sink
SDPA Output Gating Reduces Attention Sink via Sparsity
Recordings
Scaled Dot Product Attention
I used the opportunity to review foundation topics, such as the Scaled Dot Product Attention. In this video clip, I break down scaled dot-product attention from first principles, showing how queries (Q) are compared against keys (K) to produce attention scores. We then walk step by step through scaling, softmax, and value weighting, with careful attention to matrix shapes, transposes, and where each number actually comes from.
Perplexity (PPL)
Since Gated Attention evaluates performance using Perplexity (PPL), I finally got to demystify this perplexing metric, pun intended. In this video clip, I framed perplexity as a measure of how many choices a model believes it has at each step. We walk through how softmax turns scores into probabilities, why a perfect perplexity of 1 means zero uncertainty, and how different choice distributions directly affect PPL and model performance.
Gated Attention (Full Recording)
⬇️ Download the workbook and watch the full recording of the seminar.


