Batch Normalization by Hand ✍️

Jan 14, 2024

Batch normalization is a common practice to improve training and achieve faster convergence. It sounds simple. But it is often misunderstood.

🤔 Does batch normalization involve trainable parameters? tunable hyper-parameters? or both?

🤔 Is batch normalization applied to inputs, features, weights, biases, or outputs?

🤔 How is batch normalization different from layer normalization?

This hands-on exercise can help shed some light on these questions.

[1] Given
↳ A mini-batch of 4 training examples, each has 3 features.

[2] Linear Layer
↳ Multiply with the weights and biases to obtain new features

[3] ReLU
↳ Apply the ReLU activation function, which has the effect of suppressing negative values. In this exercise, -2 is set to 0.

[4] Batch Statistics
↳ Compute the sum, mean, variance, and standard deviation across the four examples in this min-batch.
↳ Note that these statistics are computed for each row (i.e., each feature dimension).

[5] Shift to Mean = 0
↳ Subtract the mean (green) from the activation values for each training example
↳ The intended effect is for the 4 activation values in each dimension to average to zero

[6] Scale to Variance = 1
↳ Divide by the standard deviation (orange)
↳ The intended effect is for the 4 activation values to have variance equal to one.

[7] Scale & Shift
↳ Multiply the normalized features from [6] by a linear transformation matrix, and pass the results to the next layer
↳ The intended effect is to scale and shift the normalized feature values to a new mean and variance, which are to be learned by the network
↳ The elements in the diagonal and the last column are trainable parameters the network will learn.

9 Batchnorm Export

337KB ∙ PDF file

Download

AI by Hand ✍️

Discussion about this post