Why Judge Calibration Matters: Sonnet vs Opus — a Case Study
Why Judge Calibration Matters: Sonnet vs Opus — a Case Study
I ran an experiment comparing two automatic judges I rely on for model evaluation: Sonnet and Opus. I needed a quick, repeatable way to rank generated outputs. I learned a few things the ha...
mickeynovels.hashnode.dev4 min read