AI by Hand ✍️

AI by Hand ✍️

Feature Extraction + Head

Fine-Tuning series · 6 of 8

Prof. Tom Yeh's avatar
Prof. Tom Yeh
Apr 24, 2026
∙ Paid

Library › Models › Fine-Tuning › Feature Extraction + Head

A feature head is a small trainable MLP bolted onto a frozen pretrained backbone. Think of it as pursuing a PhD on top of a master's degree. The master's — your pretrained backbone — stays exactly as it was, with no review. You aren't re-taking Linear Algebra or Probability; you're building something specialized on top of it: the PhD adds its own coursework, its own nonlinearity, and its own thesis layer.

This is one step richer than a linear probe, which bolts on a single linear projection — like earning one certificate after the master's. Certificates are quick and cheap, but they can only form linear combinations of subjects you already know. A feature head, with multiple trainable layers, can form nonlinear connections and capture task-specific structure the probe can't.

And it's a tighter budget than Freezing Layers, where we still refreshed a top layer's weights directly. Here, every weight in the backbone is permanently frozen — no ΔW anywhere. All trainable parameters live in the head.

In the diagram, the backbone (gray, dashed) extracts features without changing. The head (red border) is the trainable MLP: a nonlinear mapping from those frozen features to task predictions. This is the standard recipe in computer vision — take a pretrained ResNet or ViT, freeze it, and train a task-specific head on top.

How much did we save?

Full fine-tuning would update every weight in both the backbone and the head — W1 through W5:

64 × 32 + 2 × 64 × 64 + 20 × 64 + 10 × 20 = 11720

parameters.

Freezing the backbone leaves only the head trainable — W4 and W5:

20 × 64 + 10 × 20 = 1480

parameters. That's about 7.92× fewer weights to train, and — because the backbone is shared — the same frozen model can support dozens of downstream tasks, each with its own tiny PhD head.

The next lesson takes the idea in a different direction: instead of bolting a head onto the end, we'll sprinkle small trainable modules throughout the network — the adapter layers.


← Previous: Linear Probe | Adapter Layers →

Paid subscribers: open the interactive diagram below ↓

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 Tom Yeh · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture