GELU
Activation series · 8 of 12
Activation › GELU
GELU (Gaussian Error Linear Unit) is SiLU's more decisive sibling: same `x · gate` structure, but the gate now uses the Gaussian CDF Φ(x) instead of sigmoid σ(x). That swap is what made GELU the activation across BERT, GPT-2/3, T5, and ViT.
Φ has a clean approximation, σ(1.702x), so we can compare the two activations through a single shared lens: both run the same sigmoid gate, just with different inputs feeding it.
The Fate of Five Boba Shops (5 of 5)
The court evolves once more. The grading scale widens from SiLU's 7 points to GELU's 11, two extra notches at each end. SiLU's scale runs from -3 (very unprofitable) to +3 (very profitable). GELU's runs from -5 (hemorrhaging) to +5 (booming). The same shop's books earn a more extreme rating on GELU's wider scale, so the sigmoid commits faster. The 1.702 multiplier in σ(1.702x) is the conversion factor between the two scales.
GELU passes through more of clear profits and wipes more of clear debts. Around break-even both judges are nearly identical; the action is in the tails.
Walking through the Math
1. Profit: each shop's rating x (a 7-point input).
2. Scale: multiply by 1.702 to convert into GELU's 11-point view.
3. Gate: apply sigmoid to the scaled value, σ(1.702x).
4. Output: multiply the gate by the original x, giving GELU(x) ≈ x · σ(1.702x).
Profitable shops keep nearly everything; deep debts get the gate slammed shut and vanish off the books. Element-wise, like its predecessors.
Reading the Numbers
How does the decisive Gaussian judge rule?
Compare to SiLU: at -10, SiLU still tracks -0.0005 of debt; GELU has fully committed to wiping it clean. The Gaussian judge doesn't hedge.
← Previous:
SiLU
Next:
Log-Sum-Exp →




