Dropout is a simple yet effective way of reducing overfitting and improving generalization. This hands-on exercise lets students practice calculating dropout, thereby gaining insight into its inner workings.
As an additional bonus, students get to practice calculating the gradients of the Mean Square Error (MSE) loss. After the practice, students are often surprised by how simple it is.
-- 𝗡𝗲𝘁𝘄𝗼𝗿𝗸 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 --
1. Linear(2,4)
2. ReLU
3. Dropout(0.5)
4. Linear(4,3)
5. ReLU
6. Dropout(0.33)
7. Linear(3,2)
-- 𝗪𝗮𝗹𝗸𝘁𝗵𝗿𝗼𝘂𝗴𝗵 --
🏋️ Training
[1] Given
↳ A training set of 2 examples X1, X2
[2] 🟧 Random: p > 0.5
↳ Draw 4 random numbers
↳ For each random number, if it is above 0.5, we keep and denote it as ◯. Otherwise we drop and denote it as ╳.
↳ The result is [◯, ╳, ◯, ╳]
[3] 🟧 Dropout: Matrix
↳ Calculate the scaling factor: 1 / (1-p) = 2
↳ Set the diagonal based on [◯, ╳, ◯, ╳], where ◯ = 2 and ╳ = 0
↳ The purpose is to drop the 2nd and the 4th nodes, and scale the remaining two nodes by 2.
[4] 🟦 Random: p > 0.33
↳ Draw 3 random numbers
↳ For each random number, if it is above 0.33, we keep and denote it as ◯. Otherwise we drop and denote it as ╳.
↳ The result is [◯, ◯, ╳]
[5] 🟦 Dropout: Matrix
↳ Calculate the scaling factor: 1 / (1-p) = 1.5
↳ Set the diagonal based on [◯, ◯, ╳], where ◯ = 1.5 and ╳ = 0
↳ The purpose is to drop the 3rd node, and scale the remaining two nodes by 1.5.
[6] Feed Forward
↳ Now we have all the matrices ready across the layers, perform the feed forward pass by calculating a series of matrix multiplications from the top to the bottom
↳ ReLU activation function is applied along the way to set negative feature values to zeroes, denoted by ╳.
↳ The outputs are Y.
[7] 🟥 Loss Gradients of Mean Square Error (MSE)
↳ The formula is 2 * (Y - Y')
↳ First we calculate Outputs (Y) - Targets (Y')
↳ Second we multiply each element by 2
[8] Update Weights
↳ Use loss gradients to start back propagation
↳ Update some weights (light red)
↳ The values of the new weights are for demonstration purpose only, not based on real calculation.
🔍 Inference
[9] Deactivate Dropout
↳ We set both dropout matrices to identity matrices
↳ The effect is to keep all the features as is.
[10] Feed Forward
↳ Take the forward pass to make predictions about unseen data.