I agree that In Context Learning is not really learning (in fact I would just call it reasoning by analogy). But: "In practice, they describe automation of prompt engineering: feeding outputs back as inputs, applying heuristic scoring functions, selecting better completions through sampling. This is useful! It genuinely improves performance on many benchmarks. But it doesn’t transcend the pretraining distribution. The model is still generating next-token predictions, just with more elaborate scaffolding around the generation process." How do you define the "pretraining distribution" that it is not changed by changing the generator? How it could not change but still give us better results?