Batch Normalization by Hand ✍️
Batch normalization is a common practice to improve training and achieve faster convergence. It sounds simple. But it is often misunderstood.
🤔 Does batch normalization involve trainable parameters? tunable hyper-parameters? or both?
🤔 Is batch normalization applied to inputs, features, weights, biases, or outputs?
🤔 How is batch normalization different from layer normalization?
This exercise can help shed some light on these questions.
1. Start with the Input
We begin with a mini-batch:
4 training examples
Each example has 3 features
2. Apply a Linear Layer
The first transformation is a standard linear layer:
Multiply the input by weights
Add biases
This produces a new set of feature values for each example
3. Activate with ReLU
Next, we pass these features through a ReLU activation function:
Negative values are set to 0 (e.g., -2 → 0)
This introduces non-linearity and suppresses negative activations
4. Calculate Batch Statistics
Before normalizing, we calculate statistics across the mini-batch:
Sum, mean, variance, and standard deviation
These are computed for each feature dimension (row-wise)
5. Center the Data (Mean = 0)
We now shift the activations so that each feature dimension has a mean of 0:
Subtract the mean (often visualized as green in diagrams)
This ensures the 4 activation values in each dimension average to zero
6. Normalize the Variance (Variance = 1)
Next, we scale the activations so that each feature dimension has a variance of 1:
Divide by the standard deviation (orange)
This standardization stabilizes training by ensuring uniform scale
7. Learnable Scale & Shift
Finally, Batch Normalization adds flexibility back into the network by allowing it to learn how much to scale and shift the normalized values:
Multiply the normalized features by a linear transformation
The diagonal elements (scale) and the last column (shift) are trainable parameters
These parameters allow the network to restore or adjust the distribution if needed
And that’s it!
Through these 7 steps, Batch Normalization standardizes activations while giving the network control to adapt. This simple process leads to faster convergence and better performance in modern deep learning models.
🔥 New Matrix Multiplication Workbook (eBook)
Check it out at https://store.byhand.ai