This tracks with what I've seen in data work. The unsexy truth: your baseline problem isn't the model, it's your data pipeline. Most teams trying to fine-tune are working with inconsistent labels, distribution drift between train and inference, and no real validation strategy.
Better prompting plus aggressive data cleaning usually gets you further than fine-tuning. If you're at 2-3% gains from tuning, that's noise. Spend those two weeks building proper feature engineering or fixing your input normalization instead.