Hansjörg Wyss , that "which layer to blame" problem is exactly where I kept getting stuck - and your cross-model comparison trick is a clean way to isolate it. Going to find a way to build that into the debugging workflow here - probably as a standard step when a sprint turns up a behavior I can't explain. To answer your question: raw API calls. First sprint is literally while True, tools as plain functions, state as a dict. No framework until I have something hand-rolled to compare against - otherwise I won't know what the framework is actually buying me. The "break it on purpose" instinct is baked into every sprint here - each one has an explicit failures section and tests designed to find breakage, not just pass. But I'm curious what your experience was when you finally did add a framework back - did it actually solve the things that broke, or just hide them?