This is a very relevant topic. One thing I’m curious about is how you handle non-determinism in agent evals. Even with the same input and prompt, different models or model versions can give different results. Do you compare only the final answer, or also the steps the agent took to get there?