Golden Sets, LLM-as-Judge, Human Review: Which Grading Style to Reach For
You have an eval set. You know your rubric. You want a pass rate. Now a new question appears: who decides whether each individual output is a pass? You, reading them one by one? A rule engine that checks for specific strings? Another model that grade...
ai-zero-to-hero.hashnode.dev18 min read