In the previous post, we looked at Softmax and NLL loss, both critical for output interpretation and learning in Transformers. Now let’s dive into what happens within the network: activation functions. Specifically, GELU. What is GeLU? GeLU, or, Gau...
gradientlore.hashnode.dev4 min read
No responses yet.