This is an important distinction that I think will become a standard part of AI-assisted security workflows. A model identifying a vulnerability and a model recognizing a previously disclosed vulnerability can produce nearly identical outputs, but they represent completely different levels of value. The anonymization/redaction test is especially interesting because it forces us to ask whether the finding survives without the contextual clues that could trigger recall. More broadly, treating AI findings as leads that require verification not conclusions feels very similar to how mature teams already handle static analysis and automated security tooling. Confidence is not evidence; reproducibility is.