Hey Hashnode! Wanted to introduce what we're building.
RunHarness is an open evaluation harness for testing AI agents against quality baselines. If you've ever shipped an LLM-powered workflow and wondered "did this regression break it or did it actually improve?"—that's the gap we're filling.
The core idea: define test cases with expected behaviors, run your agent against them, and get a repeatable quality score you can track across versions. No more vibes-based evals.
We're focused on:
- Reproducible baselines for agent behavior
- Structured test suites (not just "does it return text")
- Integrations with common agent frameworks
Early days, but happy to connect with anyone thinking about agent eval, LLM testing, or reliability engineering. What eval setups are you all currently using?
No responses yet.