We were spending 3.4x our projected API budget.
Not because our queries were complex.
Because travelers ask the same question 50 different ways.
"Best hotels near the beach in Da Nang" "Beachfront accommodation Da Nang" "Da Nang hotel recommendations near ocean"
Same intent. Same answer. Three separate API calls.
Traditional caching caught 12% of duplicates. The other 88% burned tokens on repeat work.
So we built a dual-layer cache: exact hash + semantic similarity.
Layer 1: hash the prompt, check Redis. Under 2ms. Layer 2: if hash misses, compare meaning with vector search.
Result after 4 weeks in production:
73% cost reduction. 340ms response on cache hits (down from 2.8s). 68% combined hit rate.
The hardest part was not building it.
It was knowing when to throw the cache away.
Travel data goes stale fast. A hotel sells out. A price changes. Your cached "best option" is now wrong.
TTL alone is not enough. We added event-driven invalidation tied to real inventory changes.
Wrote the full implementation (Node.js + Redis + Qdrant) here.
If you are running LLM calls in production, what is your caching strategy?
Archit Mittal
I Automate Chaos — AI workflows, n8n, Claude, and open-source automation for businesses. Turning repetitive work into one-click systems.
Exactly 💯
It’s rarely about model cost—it’s about duplicate prompts, repeated workflows, and unnecessary API calls.
👉 Optimize flows, reuse outputs, and reduce redundancy first—then worry about cost.
This is exactly right. I've audited LLM costs for several automation clients and the pattern is always the same: they're not overpaying for tokens, they're making the same calls 50x because nobody implemented proper caching or deduplication. My go-to stack for this: semantic caching with embeddings (so similar-but-not-identical prompts hit the cache), plus a Redis layer for exact matches. One client went from \(800/month to under \)200 just by caching classification results that were being re-computed on every page load. The Node.js + Redis + Qdrant combo mentioned here is solid — event-driven invalidation is the key piece most teams miss.