This is the RAG architecture piece I wish existed two years ago.
The degradation principle stands out: Simple RAG is not a last-resort fallback, it is a legitimate operating mode. Most teams treat fallback as failure when it is actually capacity management.
Three observations from production RAG systems:
1. The cascading failure scenario is not theoretical. Guardrails false positive cascade triggers regeneration loops. Latency spikes. Users abandon. Your observability stack shows success rates while users experience failure. The circuit breaker pattern you mention is essential, not optional.
2. Index desynchronization is silent degradation. RRF fusion operates on incomplete data and search quality degrades invisibly. The monitoring signal is not error rate, it is index size delta. If dense and sparse counts diverge, retrieval quality is already compromised.
3. The 2-3 second P95 target assumes cache hits. Semantic cache is the only thing keeping Advanced RAG economically viable at scale. Miss rate is the real cost driver, not model selection.
The sovereign/air-gapped deployment note is crucial - every component competes for the same GPUs inside the perimeter. Resource contention is not a cloud problem you can horizontally scale away.
Question: In air-gapped deployments, how do you handle model weight updates without breaking the isolation? Pre-loaded weights work until the model itself needs updating.