Today is our university's "Reading Day, no class, no assignment due. Students in my Computer Vision and Generative AI courses are all supposed to be studying for the final exams.
Lately I am getting questions about MoE models from my students and LinkedIn followers (David Gong)
Why are people interested in MoE Models?
On 12/11, Mistral AI released 8 times bigger 8x7B MoE model. It closed a $415 million series-A. Several people reported this (Sophia Yang, Ph.D., Lewis Tunstall, Marko Vidrih).
How does an MoE model work?
Here is my hands-on exercise to teach my students the basis of MoE models.
𝗦𝘁𝗲𝗽-𝗯𝘆-𝗦𝘁𝗲𝗽 𝗪𝗮𝗹𝗸𝘁𝗵𝗿𝗼𝘂𝗴𝗵
1. [Inputs] The MoE block received two tokens (blue, orange).
2. Gate Network processed X1 (blue) and determined Expert 2 should be activated.
3. Expert 2 processed X1 (blue).
4. Gate Network processed X2 (orange) and determined Expert 1 should be activated.
5. Expert 1 processed X2 (orange).
6. ReLU activation function processed the outputs of the experts and produced the final output.
𝗞𝗲𝘆 𝗣𝗿𝗼𝗽𝗲𝗿𝘁𝗶𝗲𝘀
💡 𝗦𝗶𝘇𝗲: The model can get really large simply by adding more experts. In this example, adding one more expert means adding 16 more weight parameters.
💡𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆: The gate network will select a subset of experts to actually compute, in this simple exercise, one expert. In other words, only 50% of the parameters are involved in processing a token.
Taking the two properties together, we can see a sparse MoE can become really large without sacrificing efficiency.
I hope you enjoy this hands-on exercise.
What should I share next?
• More advanced MoE exercises?
• Connect this exercise to math?
• Transformer?
• CLIP?
• Mamba?
• Diffusion?
• Or what else?
Please feel free to leave a comment to let me know!
Connect this exercise to math, please