Grokking of Transformer on Modular Addition
Feb 8 · 3 min read · Background Grokking is a phenomenon where a model quickly achieves near-perfect training accuracy (memorization), while validation accuracy remains near chance for a long time, and then later transitions sharply to strong generalization after extende...
Join discussion