Switch Transformer by Hand ✍️
Calculating AI by Hand: 24 of 28
Library › Calculating AI by Hand ✍️
Switch Transformer by Hand ✍️
Reinforcement Learning with Human Feedback (RLHF) by Hand ✍️
The Switch Transformer (Fedus, Zoph, Shazeer, 2022) introduced a simple, efficient form of Sparse Mixture of Experts that scales models to trillions of parameters, the basis of the sparse MoE in models like Gemini 1.5.
How does a Switch Transformer work?
Setup
Step 1 of 13: Given
Input features (X1-X5) from the previous block
Attention
Step 2 of 13: Attention Matrix
Feed all 5 features to a query-key attention module (QK) to obtain an attention weight matrix (A).
Step 3 of 13: Pooling
Multiply the input features with the attention weight matrix to obtain attention weighted features (Z1-Z5).
The effect is to combine features across positions (horizontally)





