From Thresholds to Probabilities
Jun 25, 2025 · 4 min read · In the previous post, we looked at Softmax and NLL loss, both critical for output interpretation and learning in Transformers. Now let’s dive into what happens within the network: activation functions. Specifically, GELU. What is GeLU? GeLU, or, Gau...
Join discussion
