AI by Hand ✍️

AI by Hand ✍️

DeepSeek Attention (DSA) - Excel Blueprint

Frontier Model Math by hand ✍️

Prof. Tom Yeh's avatar
Prof. Tom Yeh
Oct 21, 2025
∙ Paid
21
Share

(P.S. This issue is written for advanced AI engineers and researchers. It is part of the premium Frontier subscription. If you are a beginner, please check out my free Foundation series, lectures, walkthroughs, and Excel exercises.)

In this Frontier issue, I’m doing something special. Instead of sharing worksheets, I’m releasing the Excel Blueprint of the DeepSeek Attention layer, which I originally created while consulting for a company.

Why do I call it a blueprint? Because its purpose is to help an AI/ML team trace every algorithmic step, verify the math, and translate it into their own codebase using their preferred programming language.

While I can’t share the company’s implementation code, I can share my blueprint. The beauty of this blueprint is that it’s framework-agnostic — once you follow the operations, you can re-implement them in PyTorch, JAX, C++, or any language of your choice.

👇 Scroll to the bottom to download my Excel Blueprint for DeepSeek Attention.

What is DeepSeek Attention?

A few weeks ago, DeepSeek released version 3.2, introducing a new mechanism called DeepSeek Attention (DSA), which adds a component known as the Lightning Indexer — a key innovation that improves attention efficiency and scalability.

Why?

DeepSeek’s core goal was simple: improve efficiency. That means two things—save time and save memory.

But time and memory often work against each other. For example, we cache keys and values (KV cache) to avoid recomputing them repeatedly. That saves time, but it costs memory. DeepSeek attacked this tradeoff on both sides: how to use less memory without increasing computation time, and how to reduce computation time without introducing accuracy loss.

Let’s walk through how.

KV Cache: Trading Space for Time

In the attention mechanism, every new query must be compared against all past keys and values. Without caching, we would need to recompute all past keys and values every time. That’s too slow.

So we cache them.

But as the context window grows—from 100K tokens to 200K—the KV cache also doubles. Even if the growth is linear, doubling KV memory usage becomes unsustainable at million-token scale.

So the question becomes: how do we shrink the KV cache without losing its time-saving benefit?

Reduce KV Cache Copies with Multi-Query Attention

In standard Multi-Head Attention, each head computes and stores its own set of keys and values. If there are 3 heads as in this example, we store 3 separate KV caches.

DeepSeek Attention switches to Multi-Query Attention (MQA). Here, all heads share the same keys and values, but queries remain head-specific.

That immediately reduces KV storage from 3 copies to just 1.

👇 Scroll to the bottom to download my Excel Blueprint for DeepSeek Attention.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Tom Yeh
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture