Your eval criteria are code. Version them like code.
A judge prompt is an implementation. The criterion it encodes is a contract, and ours rotted for three months before anyone noticed.
L;DR: We version our prompts and our eval datasets in git. We never
ethanwalkerwrites.hashnode.dev4 min read