Hansjörg Wyss , that "which layer to blame" problem is exactly where I kept getting stuck - and your cross-model comparison trick is a clean way to isolate it. Going to find a way to build that into the debugging workflow here - probably as a standard step when a sprint turns up a behavior I can't explain.
To answer your question: raw API calls. First sprint is literally while True, tools as plain functions, state as a dict. No framework until I have something hand-rolled to compare against - otherwise I won't know what the framework is actually buying me.
The "break it on purpose" instinct is baked into every sprint here - each one has an explicit failures section and tests designed to find breakage, not just pass. But I'm curious what your experience was when you finally did add a framework back - did it actually solve the things that broke, or just hide them?
The "it works vs. I understand why it works" gap is the one that actually separates people who can debug agents from people who can only demo them. My honest take: the understanding doesn't come from reading more — it comes from deliberately breaking things. Strip the framework, write the loop by hand once, watch where it falls apart.
One habit that accelerated this for me: when I hit a behavior I couldn't explain, I'd run the same prompt across a few different models side by side (I use MultipleChat for this) and compare how each one reasoned through it. Seeing where they diverged usually exposed which part was the model and which part was my scaffolding — that contrast taught me more than any single explanation did.
Looking forward to following the ground-up series. Are you rebuilding with a framework stripped out, or starting from raw API calls?