Great question, you're right that it's the hardest edge case.
This article covers the extraction side (structured citation data out of inconsistent formats). The detection/verification side is a separate system I've been building on top of it. Short version: I use a regex-first, LLM-second hybrid for determining whether a specific source was actually cited. Regex catches ~73% of verbatim citations with <2% false positives, and an LLM classifier handles the paraphrased references regex misses. The hybrid gets to ~96% true positive rate at ~4% false positive.
On fabricated citations specifically, the research is sobering. Maheshwari et al. (2025) at Amazon found that generative search engines only achieve about 74% citation accuracy. The Princeton ALCE benchmark found that even the best models lack complete citation support 50% of the time. So the problem isn't hypothetical, it's baseline behavior.
For monitoring, the known-domain matching in Pass 1 sidesteps hallucination by design since you're matching against entities you know are real. Where it bites is in full extraction (Pass 2), where a model can generate plausible URLs to pages that don't exist. I've seen ChatGPT do this with correct domains but fabricated paths.
A proper verification layer (HEAD requests to confirm URLs resolve + content matching to confirm the source supports the claim) is the logical next step but a separate system I haven't shipped yet. That's the gap between "citation extraction" and "citation trust" and I think most people in this space are underestimating it.
Planning to write up the detection/verification layer in more detail soon. Appreciate you flagging the exact problem that makes it necessary.