[1] Forward Pass
↳ Given a multi layer perceptron (3 levels), an input vector X, predictions Y^{Pred} = [0.5, 0.5, 0], and ground truth label Y^{Target} = [0, 1, 0].
[2] Backpropagation
↳ Insert cells to hold our calculations.
[3] Layer 3 - Softmax (blue)
↳ Calculate ∂L / ∂z3 directly using the simple equation: Y^{Pred} - Y^{Target} = [0.5, -0.5, 0].
↳ This simple equation is the benefit of using Softmax and Cross Entropy Loss together.
[4] Layer 3 - Weights (orange) & Biases (black)
↳ Calculate ∂L / ∂W3 and ∂L / ∂b3 by multiplying ∂L / ∂z3 and [ a2 | 1 ].
[5] Layer 2 - Activations (green)
↳ Calculate ∂L / ∂a2 by multiplying ∂L / ∂z3 and W3.
[6] Layer 2 - ReLU (blue)
↳ Calculate ∂L / ∂z2 by multiplying ∂L / ∂a2 with 1 for positive values and 0 otherwise.
[7] Layer 2 - Weights (orange) & Biases (black)
↳ Calculate ∂L / ∂W2 and ∂L / ∂b2 by multiplying ∂L / ∂z2 and [ a1 | 1 ].
[8] Layer 1 - Activations (green)
↳ Calculate ∂L / ∂a1 by multiplying ∂L / ∂z2 and W2.
[9] Layer 1 - ReLU (blue)
↳ Calculate ∂L / ∂z1 by multiplying ∂L / ∂a1 with 1 for positive values and 0 otherwise.
[10] Layer 1 - Weights (orange) & Biases (black)
↳ Calculate ∂L / ∂W1 and ∂L / ∂b1 by multiplying ∂L / ∂z1 and [ x | 1 ].
[11] Gradient Descent
↳ Update weights and biases (typically a learning rate is applied here).
Insights
💡 Matrix Multiplication is All You Need: Just like in the forward pass, backpropagation is all about matrix multiplications. You can definitely do everything by hand as I demonstrated in this exercise, albeit slow and imperfect. This is why GPU's ability to multiply matrices efficiently plays such an important role in the deep learning evolution. This is why NVIDIA is now close to $1 trillion in valuation.
💡Exploding Gradients: We can already see the gradients are getting larger as we back-propagate up, even in this simple 3-layer network. This motivates using methods like skip connections to handle exploding (or diminishing) gradients as in the ResNet.
I did the calculations entirely by hand. Please let me know if you spot any error or have any questions!
There might be a mistake? ∂L / ∂z1 should be [ 1 , -2, 2 , -1] ?
Looks like ∂L / ∂z2 = [ 1 , 0 ] and ∂L / ∂z1 = [ 1 , 0 , 2 , 0 ] instead because of ReLU, no?