AI by Hand ✍️

AI by Hand ✍️

Inference Batching, Request-vs-Token Level

Frontier Model Math by hand ✍️

Prof. Tom Yeh's avatar
Prof. Tom Yeh
Sep 16, 2025
∙ Paid
10
2
Share

(P.S. This issue is written for advanced AI engineers and researchers. It is part of the premium Frontier subscription. If you are a beginner, please check out my free lectures, walkthroughs, and Excel exercises. I also share regular announcements of free learning opportunities.)

A Deep Dive on AI Inference Startups - by Kevin Zhang

Lately I have been helping an inference startup check whether their batch inference implementation really matches the math in the paper. In conversations with their engineers, I learned that one of the most confusing questions is why some operations can be “batched” and run in parallel across many requests, while others must be executed in strict sequence.

Can QKV projections be parallelized? What about the scaled dot product attention, or the feedforward layer? These are not minor implementation details — they determine whether the system is both correct and efficient.

Inference has become a billion-dollar battleground, with companies like Together, Anyscale, and Fireworks attracting huge VC bets. It is hard. Very few truly understand it, and those who do command the highest-paying jobs in the industry.

Worksheets

Two weeks ago we looked at the KV cache and the contrast between prefill and decode for one request. I was laying the foundation so that I can extend the story to multiple requests, and how they can be “batched” to run efficiently.

For this week’s Frontier issue, I created four new worksheets:

  1. Single Request (no batching)

  2. Request-level Batching

  3. Token-level Batching (a.k.a., continuous batching)

  4. Decoding at different positions, the defining property of continuous batching

Work through these step by step, and you will see the answer to the question we started with: which parts of the inference pipeline can be combined and run in parallel across requests, and which must be executed in sequence. That clarity is the key to understanding why batching is the true secret sauce of inference startups!

⬇️ Download the worksheets below (for Frontier Subscribers only)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Tom Yeh
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture