Your Evals Have a Rotten Tomatoes Problem
You push a change to a prompt and your eval score drops from 0.91 to 0.84. Something got worse, but the score doesn’t tell you what. So you start re-running the pipeline, tweaking your inputs, pouring over outputs, trying to figure out which part of ...
engineering.fractional.ai7 min read