The separation between workflow_planning and execution instances is the architectural insight most multi-agent systems miss.
When planning stays implicit in a single agent, the system has no methodological layer to improve. The workflow_planning instance makes "how we work" explicit and refactorable—same pattern I see in production agent systems that survive beyond demos.
The dual validation (local + global) also addresses a real gap. Most RAG and agent systems optimize for local output quality. But in long-dependency workflows, an artifact can be locally sufficient and globally inconsistent. Making that tension architectural rather than ex-post verification matters for reliability.
The improvement_records layer is essentially a formalized learning mechanism. Without it, corrections live in ephemeral conversation memory or manual intervention. With it, the architecture can accumulate lessons that condition future planning.
One question: the direct-to-model prompting hypothesis runs counter to the common practice of role-persona prompts. Have you observed practical differences in focus and token efficiency when speaking to the model versus speaking to an assigned character? The literature on persona prompts is skeptical—but empirical testing would clarify whether this is a meaningful architectural decision or a preference.
Strong framework. The value is in making workflow organization an explicit design problem rather than an emergent property of agent counts and prompts.