The part that trips up most teams I talk to is the eval-style testing section. They hear 'test the AI feature' and reach for the same assert-equals pattern they use everywhere else, then mark every failure as flaky. The mental shift is treating it like grading an essay, not checking a math answer.
You are scoring properties against a rubric, not matching a string. Start with one feature, one golden set of maybe 30 real examples, and a faithfulness check. That alone catches more than most full suites do today.