We were spending 3.4x our projected API budget.
Not because our queries were complex.
Because travelers ask the same question 50 different ways.
"Best hotels near the beach in Da Nang" "Beachfront accommodation Da Nang" "Da Nang hotel recommendations near ocean"
Same intent. Same answer. Three separate API calls.
Traditional caching caught 12% of duplicates. The other 88% burned tokens on repeat work.
So we built a dual-layer cache: exact hash + semantic similarity.
Layer 1: hash the prompt, check Redis. Under 2ms. Layer 2: if hash misses, compare meaning with vector search.
Result after 4 weeks in production:
73% cost reduction. 340ms response on cache hits (down from 2.8s). 68% combined hit rate.
The hardest part was not building it.
It was knowing when to throw the cache away.
Travel data goes stale fast. A hotel sells out. A price changes. Your cached "best option" is now wrong.
TTL alone is not enough. We added event-driven invalidation tied to real inventory changes.
Wrote the full implementation (Node.js + Redis + Qdrant) here.
If you are running LLM calls in production, what is your caching strategy?
No responses yet.