The most cited deep learning paper ever (>200K), yet often hidden from the public's eyes, is "Deep Residual Learning for Image Recognition" published by Kaiming He in CVPR 2016.
To put 200K citations in context, CVPR accepts 2000 papers on average. It will take 100 years for every single accepted paper in CVPR to cite this paper to get to 200K citations.
Wow!
Why is ResNet so important?
Because it found a simple solution to solve the exploding and diminishing gradient problems of deep neural networks. It made 10,000's layers possible.
How simple is this solution?
An identify matrix!
-- 𝗪𝗮𝗹𝗸𝘁𝗵𝗿𝗼𝘂𝗴𝗵 --
R͟e͟s͟i͟d͟u͟a͟l͟ ͟B͟l͟o͟c͟k͟
[1] Given
↳ A mini batch of 3 input vectors (3D)
[2] Linear Layer
↳ Multiply the input with weights and bias
↳ Apply ReLU (negatives → 0)
↳ Obtain 3 feature vectors
[3] Concatenate
↳ Stack (horizontally) an identity matrix and the weight and bias matrix of the 2nd layer
↳ Stack (vertically) the input vectors and the feature vectors from the previous layer
↳ Draw lines to visualize the links between rows (weights) and columns (features)
↳ These links are the "Skip Connections"
[4] Linear Layer + Identity
↳ Multiply the two stacked matrices
↳ This is equivalent to F(X) + X
↳ Apply ReLU (negatives → 0)
↳ Pass the results to the next residual block
T͟r͟a͟n͟s͟f͟o͟r͟m͟e͟r͟'͟s͟ ͟E͟n͟c͟o͟d͟e͟r͟ ͟B͟l͟o͟c͟k͟
Let's see how residual block (aka. skip connections) work in a transformer model
[5] 🟩 Attention
↳ A sequence of 3 input vectors (2D)
↳ Compute an attention matrix (bright yellow)
↳ Multiply the input vectors with the attention matrix to obtain attention-weighted vectors.
↳ Read [Self Attention by hand ✍️ ] https://lnkd.in/gDW8Um4W to learn more.
[6] 🟩 Concatenate
↳ Stack two identity matrices (to achieve 1 + 1)
↳ Stack the input vectors and the attention-weighted vectors
↳ Draw skip connections
[7] 🟩 Add
↳ Multiply the two stacked matrices
[8] 🟪 Feed Forward: First Layer
↳ Multiply the input with weights and bias
↳ Apply ReLU (negatives → 0)
↳ Obtain 3 feature vectors
[9] 🟪 Feed Forward: Concatenate
↳ Stack and visualize like step [3]
[10] 🟪 Feed Forward: Second Layer + Identity
↳ Multiply the two stacked matrices
↳ Apply ReLU (negatives → 0)
↳ Pass the results to the next encoder block
Did you notice that sizes and positions of the identity matrices are different?
This difference illustrates the magic of the transformer block:
• The attention layer combines positions (across columns)
• The feed forward layer combines features (across rows)
Practice Blank Sheet
Step by Step Guide