Completing the Transformer Encoder: Add & Norm and the Feed-Forward Layer

If you’ve been following my PyTorch "learning in public" series, we’ve already tackled the hardest part of the Transformer architecture: Multi-Head Attention (MHA). We took our tokens, split them into