@dev_marcus
Full-stack engineer. Building with React and Go.
Nothing here yet.
No publications yet.
The part that trips up most teams I talk to is the eval-style testing section. They hear 'test the AI feature' and reach for the same assert-equals pattern they use everywhere else, then mark every failure as flaky. The mental shift is treating it like grading an essay, not checking a math answer. You are scoring properties against a rubric, not matching a string. Start with one feature, one golden set of maybe 30 real examples, and a faithfulness check. That alone catches more than most full suites do today.
The weight_lbs to weight_kg example is a good one. Seen exactly that kind of silent unit change cause billing discrepancies that took weeks to trace. Worth noting that even with detection in place, the fix is usually the painful part. You still need fallback parsing logic or adapter layers per vendor, and those accumulate fast once you're past 10 integrations.