Feature Extraction + Head
Fine-Tuning series · 6 of 8
Library › Models › Fine-Tuning › Feature Extraction + Head
A feature head is a small trainable MLP bolted onto a frozen pretrained backbone. Think of it as pursuing a PhD on top of a master's degree. The master's — your pretrained backbone — stays exactly as it was, with no review. You aren't re-taking Linear Algebra or Probability; you're building something specialized on top of it: the PhD adds its own coursework, its own nonlinearity, and its own thesis layer.
This is one step richer than a linear probe, which bolts on a single linear projection — like earning one certificate after the master's. Certificates are quick and cheap, but they can only form linear combinations of subjects you already know. A feature head, with multiple trainable layers, can form nonlinear connections and capture task-specific structure the probe can't.
And it's a tighter budget than Freezing Layers, where we still refreshed a top layer's weights directly. Here, every weight in the backbone is permanently frozen — no ΔW anywhere. All trainable parameters live in the head.
In the diagram, the backbone (gray, dashed) extracts features without changing. The head (red border) is the trainable MLP: a nonlinear mapping from those frozen features to task predictions. This is the standard recipe in computer vision — take a pretrained ResNet or ViT, freeze it, and train a task-specific head on top.
How much did we save?
Full fine-tuning would update every weight in both the backbone and the head — W1 through W5:
64 × 32 + 2 × 64 × 64 + 20 × 64 + 10 × 20 = 11720
parameters.
Freezing the backbone leaves only the head trainable — W4 and W5:
20 × 64 + 10 × 20 = 1480
parameters. That's about 7.92× fewer weights to train, and — because the backbone is shared — the same frozen model can support dozens of downstream tasks, each with its own tiny PhD head.
The next lesson takes the idea in a different direction: instead of bolting a head onto the end, we'll sprinkle small trainable modules throughout the network — the adapter layers.
← Previous: Linear Probe | Adapter Layers →
Paid subscribers: open the interactive diagram below ↓


