In this article
- Where autonomy breaks: silent repair and hidden drift
- A concrete failure: how the pipeline closed a task on diverging reports
- Solution 1: JSON contract between phases
- Solution 2: independent VERIFY outside the agent perimeter
- Solution 3: router and five models for five phases
- Principles and Closing
Where autonomy breaks
Technical debt is the kind of work nobody has time for. It looks like the perfect target for automation: split the roles (PM, system analyst, developer, QA) and run an autonomous loop over the ticket queue.
On paper, it works. In practice, autonomy turns into silent repair.
The agent hits a problem in its environment — and does not report a block. Instead it starts editing neighboring code, poking config files, “optimizing” things it was not asked to touch. The output is “Done”, but half the environment has drifted away from the spec.
I call this hidden drift. Each individual edit looks like a local improvement. Stacked together, they become a silent repair nobody requested — and one that is hard to roll back, because the Git diff stopped representing a single task long ago.
A concrete failure
Today the pipeline closed a task as DONE. The developer agent did the work, ran check_command, got exit=0, reported passed=true.
An independent parallel check returned exit=1.
Nobody lied. check_command contained a test pattern that behaves differently on linux and macOS. The build artifact turned out to be incompatible with the verification environment. The agent saw “success” from its own vantage point. From another vantage point — failure.
This one case shapes the entire architecture downstream. If we trust the agent’s self-report, we trust a slice of the environment it happens to be in. And the environment is a variable.
Solution 1: JSON contract between phases
In early iterations, phases talked to each other in free-form text. The analyst described a plan, the developer read it and interpreted. Sounds like a normal human process. In practice, it produces three problems:
- the model invents fields nobody asked for
- the next phase has to parse free-form text
- “I thought about it and decided to do a bit more” is a real phrase in our logs
We moved to a JSON schema between every pair of phases. Constrained generation via tool_use forces the model to fill exactly the fields the next phase expects, and nothing extra.
Each transition is a structural artifact bound to the schema. No “free creativity”. Want to add a thought — there is no field for it in the schema, so the thought does not propagate. Free-form text stays in one place only — human-facing comments, which never feed the next step.
The point: the schema is not the “nicest format” for humans, it is the format the next phase needs. If reviewers want to see something else, render a human-readable view from the same JSON. Do not blend the two channels.
Solution 2: independent VERIFY
From the check_command story I extracted a hard rule: verification must live outside the agent.
On the left — the agent’s environment. It runs there: it knows its dependencies, local configs, caches, env vars. It can do whatever it wants. The point is, it reports a result.
On the right — the verify environment. Fresh clone, clean container, no artifacts from the agent, no pre-built binaries. It runs the same check_command from an independent point. If the two results disagree — the task moves to BLOCKED.
This is not more expensive than trusting. One extra verification run costs far less than a silent close with hidden drift. The audit of those closes will happen anyway — just a month later, with the context already lost.
Solution 3: model specialization via router
One universal agent for everything is an anti-pattern. Every phase has its own load profile, and a universal model either overpays or underdelivers.
Five roles for five load types:
- planner — structural decisions, sequencing and precision over JSON
- reasoning specialist — heavy justification, architectural reasoning
- long-context — summarizing documents and long tickets
- executor — short patches, routine operations
- fast chat — conversational replies, simple answers
The router looks at the phase type and artifact complexity, and routes the call to the matching model. Locally, with no lock-in to a single provider. This is about cost, and also about the fact that different models are good at different things — there is no point producing a structural plan with a model tuned for long-form dialogue.
Principles
- Contract discipline beats generation speed. Free-form text between phases is an invitation to drift.
- Do not trust self-report. VERIFY lives outside the agent perimeter.
- Stop explicitly. BLOCKED is a status, not a fallback. If something does not add up — the task waits for a human. It does not close as “Done”.
- Specialization beats universality. A pipeline of specialized models is cheaper and more stable than one large model doing everything.
Closing
Agent autonomy is not freedom. It is a contract. The tighter the contract and the more independent the verification, the less drift and the fewer silent closes.
Knowing when to stop is a skill. A pipeline that can say “I don’t know” is safer than a pipeline that always says “done”.
Architectural question: where in your process is the agent’s self-report currently treated as truth, and what is the smallest step that would put an independent check next to it?