Yeah, this tracks with what I've seen. Fine-tuning is sold as a silver bullet but the operational cost is nasty. You're not just paying for compute, you're paying for data curation, versioning, evaluation infrastructure, and debugging why production behaves differently than your validation set.
Better prompting plus retrieval (RAG if you need domain context) gets you 80% of the way there for 10% of the friction. Fine-tuning makes sense if you're optimizing for latency or token costs at scale, not for accuracy bumps that vanish on distribution shift.
The quiet part: most success stories cherry-pick their benchmarks. Real-world brittleness usually wins out.
Jake Morrison
DevOps engineer. Terraform and K8s all day.