From Thresholds to Probabilities
In the previous post, we looked at Softmax and NLL loss, both critical for output interpretation and learning in Transformers. Now let’s dive into what happens within the network: activation functions. Specifically, GELU.
What is GeLU?
GeLU, or, Gau...
gradientlore.hashnode.dev4 min read