I’d treat model ID as the calibration boundary. The backing is that conformal coverage depends on calibration and deployment examples being exchangeable under the same scoring setup. Once the underlying model changes, the score distribution can change too, so I would not let a new model silently inherit the old calibration guarantee. Rolling windows are useful for drift monitoring and threshold refreshes, but for compliance I’d keep calibration sets versioned by model + prompt + task + data distribution.
Archit Mittal
I Automate Chaos — AI workflows, n8n, Claude, and open-source automation for businesses. Turning repetitive work into one-click systems.
Conformal prediction applied to LLMs is way under-discussed — most teams are still just tracking "accuracy" on a golden set and crossing fingers. The single-number coverage guarantee is what makes it actually sellable to compliance teams. One challenge I'd love your take on: calibration drift when you swap an underlying model version. Do you recalibrate on a rolling window, or treat each model ID as its own independent calibration set? We've gone back and forth on that for a client in healthcare-adjacent workflows.