Most teams don't have an LLM cost problem. They have a redundancy problem.

We were spending 3.4x our projected API budget.

Not because our queries were complex.

Because travelers ask the same question 50 different ways.

"Best hotels near the beach in Da Nang" "Beachfront accommodation Da Nang" "Da Nang hotel recommendations near ocean"

Same intent. Same answer. Three separate API calls.

Traditional caching caught 12% of duplicates. The other 88% burned tokens on repeat work.

So we built a dual-layer cache: exact hash + semantic similarity.

Layer 1: hash the prompt, check Redis. Under 2ms. Layer 2: if hash misses, compare meaning with vector search.

Result after 4 weeks in production:

73% cost reduction. 340ms response on cache hits (down from 2.8s). 68% combined hit rate.

The hardest part was not building it.

It was knowing when to throw the cache away.

Travel data goes stale fast. A hotel sells out. A price changes. Your cached "best option" is now wrong.

TTL alone is not enough. We added event-driven invalidation tied to real inventory changes.

Wrote the full implementation (Node.js + Redis + Qdrant) here.

If you are running LLM calls in production, what is your caching strategy?

This is exactly right. I've audited LLM costs for several automation clients and the pattern is always the same: they're not overpaying for tokens, they're making the same calls 50x because nobody implemented proper caching or deduplication. My go-to stack for this: semantic caching with embeddings (so similar-but-not-identical prompts hit the cache), plus a Redis layer for exact matches. One client went from \(800/month to under \)200 just by caching classification results that were being re-computed on every page load. The Node.js + Redis + Qdrant combo mentioned here is solid — event-driven invalidation is the key piece most teams miss.

The classification caching case is a great example. Repeated classification calls are the easiest wins because the input-output mapping is deterministic. No risk of serving stale "creative" responses, just consistent labels.

One thing we learned on the invalidation side: travel data needs aggressive TTLs on anything pricing-related (15 minutes max), but room-type descriptions and policy content can stay cached for 24+ hours. Splitting the cache by content type rather than one blanket TTL made a significant difference in the hit rate vs. freshness tradeoff.

\(800 to \)200 is roughly the same ratio we saw. Most of it comes before you even touch model selection.

Exactly 💯

It’s rarely about model cost—it’s about duplicate prompts, repeated workflows, and unnecessary API calls.

👉 Optimize flows, reuse outputs, and reduce redundancy first—then worry about cost.

Exactly this. The order of operations matters: deduplication and caching first, then routing to cheaper models, then prompt compression. Most teams do it backwards and wonder why switching from GPT-4o to a smaller model only saved 20% when they expected 70%.

Reduce the calls before you optimize the calls.

Thread

Most teams don't have an LLM cost problem. They have a redundancy problem.

Responses(4)

Recent in Forum

Search Hashnode

Most teams don't have an LLM cost problem. They have a redundancy problem.

Responses(4)

Recent in Forum