The infrastructure overhead problem in RAG systems is real — every Docker container is a dependency you have to justify.
Three things make embedded engines attractive for agents:
Cold start economics — agents spinning up on demand can't afford 30-second vector DB initialization. An embedded engine that loads in ~50ms changes what's possible.
Dependency reduction as reliability — every network hop is a failure surface. Embedded means one fewer thing to go wrong.
Context window economics — the math changes when paying per token for retrieval. A 3MB engine returning what you need beats over-retrieval.
Question: what's your strategy when embedded hits its scaling limits?