LoRA
Fine-Tuning series · 8 of 8
Library › Models › Fine-Tuning › LoRA
Adapters showed that a low-rank bottleneck can specialize a frozen layer with far fewer parameters than a full ΔW.
LoRA (Hu et al., 2021) takes the same low-rank idea and makes two changes:
1. Drop the nonlinearity.
"Down-project + ReLU + up-project" collapses into a single low-rank matrix product: Δ W = B A.
2. Add to the weight matrix, not between layers.
The effective weight is W' = W + Δ W = W + B A.
At inference, Δ W disappears — meaning it gets folded directly into W.
Below, the left column shows an adapter (a new module in the forward path). The right column shows LoRA: same low-rank idea, but expressed as a correction to the weight matrix.
The benefit of dropping the nonlinearity is that B and A can be merged into a single low-rank ΔW matrix at inference time. During training, B and A live alongside W as extra matrices.
You compute the weight update in two separate paths so gradients can flow into both matrices during training.
At inference you merge them once and for all: compute the combined weight, ship it, and discard the individual matrices. The resulting model has the exact same architecture and inference cost as the pretrained one. Same shape, same number of matrix multiplications, no extra latency, no extra memory.
An adapter can't do this. Its down-project–ReLU–up-project module stays in the forward path forever, and every inference call pays for those extra operations.
This is why LoRA is the default for large models today. You can keep thousands of tiny (B, A) pairs — one per fine-tuning task, a few megabytes each. Swap them in and out of a shared frozen backbone without changing its runtime shape.
← Previous: Adapter Layers
Paid subscribers: open the interactive diagram below ↓


