AI by Hand ✍️

AI by Hand ✍️

Inference Batching, Request-vs-Token Level

Frontier AI Drawings: 9 of 13

Prof. Tom Yeh's avatar
Prof. Tom Yeh
Sep 16, 2025
∙ Paid

Frontier AI Drawings: the series

  1. "Expert Choice" Mixture of Experts (MoE)

  2. MHA, MQA, GQA, MoE-A: More Attention!

  3. New GPT-OSS Trick to Ignore Tokens

  4. MXFP4, FP4, FP8

  5. LoRA, Fine-Tune, Pre-Train

  6. QLoRA, DoRA, BitFit, NF4 vs INT4

  7. KV Cache, Prefill, Decode

  8. EmbeddingGemma, MRL, InfoNCE, Embed vs. Decode

  9. Inference Batching, Request-vs-Token Level

  10. MLP Parallelism: Data, Context, Row, Column, Pipeline

  11. RoPE vs PE in QKV Self-Attention

  12. RMS, Group, Layer, Batch Norm, Tensor Parallelism

  13. Qwen 3

A Deep Dive on AI Inference Startups - by Kevin Zhang

Lately I have been helping an inference startup check whether their batch inference implementation really matches the math in the paper. In conversations with their engineers, I learned that one of the most confusing questions is why some operations can be “batched” and run in parallel across many requests, while others must be executed in strict sequence.

Can QKV projections be parallelized? What about the scaled dot product attention, or the feedforward layer? These are not minor implementation details — they determine whether the system is both correct and efficient.

Inference has become a billion-dollar battleground, with companies like Together, Anyscale, and Fireworks attracting huge VC bets. It is hard. Very few truly understand it, and those who do command the highest-paying jobs in the industry.

Drawings

Two weeks ago we looked at the KV cache and the contrast between prefill and decode for one request. I was laying the foundation so that I can extend the story to multiple requests, and how they can be “batched” to run efficiently.

I created four new drawings:

  1. Single Request (no batching)

  2. Request-level Batching

  3. Token-level Batching (a.k.a., continuous batching)

  4. Decoding at different positions, the defining property of continuous batching

Work through these step by step, and you will see the answer to the question we started with: which parts of the inference pipeline can be combined and run in parallel across requests, and which must be executed in sequence. That clarity is the key to understanding why batching is the true secret sauce of inference startups!


Page 1 of 4

Become a member to access the rest of the drawings.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 Tom Yeh · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture