Deeper Transformers Are Forgetting What They Learned. MoDA Is the Fix.
Papers I'm Reading — Issue #03
Paper: Mixture-of-Depths Attention (MoDA) arXiv: 2603.15619 | cs.LG Authors: Lianghui Zhu, Yuxin Fang, Bencheng Liao et al. — Huazhong University of Science & Technolo
abishkamran.hashnode.dev18 min read