Live Version:
Baseline Version
I will use Vision Transformer (ViT) as the baseline and extend it to Llama 1, 2, 3, 4 live.
Baseline: ViT
+ RMSNorm
+ model dimensions
+ layers
+ RoPE
+ Group Query Attention
+ Sparse Attention
+ Flash Attention
+ context length
Or download