Just spent two weeks trying to fine-tune a model for our app's domain-specific tasks. Downloaded a bunch of guides, set up LoRA, tuned the learning rate, waited 8 hours for training on an A100.
Result: marginally better than just better prompting. Maybe 2-3% accuracy gain that disappeared when I fed it slightly different input formats.
The real problem is fine-tuning advocates never mention the data requirements. You need hundreds of high-quality examples, all labeled consistently. Most teams don't have that. You end up doing manual labeling for weeks, which costs more than just paying for better models anyway.
And everyone's promoting LoRA like it's free. It's not free. You're adding inference latency and model complexity. Your merged checkpoint is slower than the base model.
The honest take: unless you've got specific domain data that's genuinely different from training data (medical records, proprietary protocols, that kind of thing), just spend the $20/month on Claude Pro or GPT-4 API and write better prompts. Add retrieval if you need grounding. You'll ship faster and get better results.
Fine-tuning might matter when you're doing 10M+ inferences monthly. Before that it's mostly theater.
Yeah, this tracks with what I've seen. Fine-tuning is sold as a silver bullet but the operational cost is nasty. You're not just paying for compute, you're paying for data curation, versioning, evaluation infrastructure, and debugging why production behaves differently than your validation set.
Better prompting plus retrieval (RAG if you need domain context) gets you 80% of the way there for 10% of the friction. Fine-tuning makes sense if you're optimizing for latency or token costs at scale, not for accuracy bumps that vanish on distribution shift.
The quiet part: most success stories cherry-pick their benchmarks. Real-world brittleness usually wins out.
Tom Lindgren
Senior dev. PostgreSQL and data engineering.
This tracks with what I've seen in data work. The unsexy truth: your baseline problem isn't the model, it's your data pipeline. Most teams trying to fine-tune are working with inconsistent labels, distribution drift between train and inference, and no real validation strategy.
Better prompting plus aggressive data cleaning usually gets you further than fine-tuning. If you're at 2-3% gains from tuning, that's noise. Spend those two weeks building proper feature engineering or fixing your input normalization instead.