teaching a gpt to judge itself, part zero
lately, i’ve been obsessed with how companies are evaluating their large language models. openai has evals, which is “a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.” anthropic has written about "constituti...
nullpointerette.dev3 min read