This matches my experience more than most automation writing. The line that lands hardest is "did the product break, or did the test break" - on a backend serving seven teams, as the only QA, that investigation was eating more of my week than writing the tests ever did.
The flaky-as-trust-risk framing is right. Once people start re-running a red build by reflex, the suite has already lost. What helped me was refusing to label anything "flaky" as a category and instead forcing a class on every failure: timing, selector drift, data state, environment, or a real regression. A bucket you cannot name is usually one you have stopped reading.
On AI as a junior QA that still needs review, agreed, and the review cost is real and recurring. A generated test that nobody on the team can debug six weeks later is not coverage, it is debt with a green checkmark.
Question on your maintenance model: when a test survives a UI change without edits, how do you confirm it still asserts the original intent and did not quietly degrade into a weaker check that just happens to pass?