Batch normalization is a common practice to improve training and achieve faster convergence. It sounds simple. But it is often misunderstood.
🤔 Does batch normalization involve trainable parameters? tunable hyper-parameters? or both?
🤔 Is batch normalization applied to inputs, features, weights, biases, or outputs?
🤔 How is batch normalization different from layer normalization?
This hands-on exercise can help shed some light on these questions.
[1] Given
↳ A mini-batch of 4 training examples, each has 3 features.
[2] Linear Layer
↳ Multiply with the weights and biases to obtain new features
[3] ReLU
↳ Apply the ReLU activation function, which has the effect of suppressing negative values. In this exercise, -2 is set to 0.
[4] Batch Statistics
↳ Compute the sum, mean, variance, and standard deviation across the four examples in this min-batch.
↳ Note that these statistics are computed for each row (i.e., each feature dimension).
[5] Shift to Mean = 0
↳ Subtract the mean (green) from the activation values for each training example
↳ The intended effect is for the 4 activation values in each dimension to average to zero
[6] Scale to Variance = 1
↳ Divide by the standard deviation (orange)
↳ The intended effect is for the 4 activation values to have variance equal to one.
[7] Scale & Shift
↳ Multiply the normalized features from [6] by a linear transformation matrix, and pass the results to the next layer
↳ The intended effect is to scale and shift the normalized feature values to a new mean and variance, which are to be learned by the network
↳ The elements in the diagonal and the last column are trainable parameters the network will learn.