The Discipline of Agent Pipelines: Teaching Agents to Stop

Where autonomy breaks

Technical debt is the kind of work nobody has time for. It looks like the perfect target for automation: split the roles (PM, system analyst, developer, QA) and run an autonomous loop over the ticket queue.

On paper, it works. In practice, autonomy turns into silent repair.

The agent hits a problem in its environment — and does not report a block. Instead it starts editing neighboring code, poking config files, “optimizing” things it was not asked to touch. The output is “Done”, but half the environment has drifted away from the spec.

I call this hidden drift. Each individual edit looks like a local improvement. Stacked together, they become a silent repair nobody requested — and one that is hard to roll back, because the Git diff stopped representing a single task long ago.

A concrete failure

Today the pipeline closed a task as DONE. The developer agent did the work, ran check_command, got exit=0, reported passed=true.

An independent parallel check returned exit=1.

Nobody lied. check_command contained a test pattern that behaves differently on linux and macOS. The build artifact turned out to be incompatible with the verification environment. The agent saw “success” from its own vantage point. From another vantage point — failure.

This one case shapes the entire architecture downstream. If we trust the agent’s self-report, we trust a slice of the environment it happens to be in. And the environment is a variable.

Solution 1: JSON contract between phases

In early iterations, phases talked to each other in free-form text. The analyst described a plan, the developer read it and interpreted. Sounds like a normal human process. In practice, it produces three problems:

the model invents fields nobody asked for
the next phase has to parse free-form text
“I thought about it and decided to do a bit more” is a real phrase in our logs

We moved to a JSON schema between every pair of phases. Constrained generation via tool_use forces the model to fill exactly the fields the next phase expects, and nothing extra.

stateDiagram-v2 [*] --> PLAN PLAN --> SPEC: JSON plan SPEC --> DEV: JSON spec DEV --> QA: JSON diff + report QA --> VERIFY: JSON review VERIFY --> DONE: independent pass VERIFY --> BLOCKED: mismatch PLAN --> BLOCKED: cannot plan SPEC --> BLOCKED: ambiguity DEV --> BLOCKED: env issue QA --> BLOCKED: review fail BLOCKED --> [*]: handover to human DONE --> [*]

Each transition is a structural artifact bound to the schema. No “free creativity”. Want to add a thought — there is no field for it in the schema, so the thought does not propagate. Free-form text stays in one place only — human-facing comments, which never feed the next step.

The point: the schema is not the “nicest format” for humans, it is the format the next phase needs. If reviewers want to see something else, render a human-readable view from the same JSON. Do not blend the two channels.

Solution 2: independent VERIFY

From the check_command story I extracted a hard rule: verification must live outside the agent.

graph TB subgraph agentenv ["Agent environment"] A1[Developer agent] A2[Local config] A3[Cache, env vars] A4[Build artifact] A1 --- A2 A1 --- A3 A1 --- A4 end subgraph verifyenv ["Verify environment"] V1[Fresh clone] V2[Clean container] V3[Independent check_command] V1 --- V2 V2 --- V3 end A1 -->|reports passed=true| C{Compare} V3 -->|independent exit code| C C -->|match| OK[DONE] C -->|mismatch| BL[BLOCKED]

On the left — the agent’s environment. It runs there: it knows its dependencies, local configs, caches, env vars. It can do whatever it wants. The point is, it reports a result.

On the right — the verify environment. Fresh clone, clean container, no artifacts from the agent, no pre-built binaries. It runs the same check_command from an independent point. If the two results disagree — the task moves to BLOCKED.

This is not more expensive than trusting. One extra verification run costs far less than a silent close with hidden drift. The audit of those closes will happen anyway — just a month later, with the context already lost.

Solution 3: model specialization via router

One universal agent for everything is an anti-pattern. Every phase has its own load profile, and a universal model either overpays or underdelivers.

graph LR IN[Phase request] --> R{Local router} R -->|structural JSON| M1[Planner] R -->|deep reasoning| M2[Reasoning specialist] R -->|long document| M3[Long-context] R -->|short patch| M4[Executor] R -->|simple reply| M5[Fast chat] M1 --> OUT[Structural artifact] M2 --> OUT M3 --> OUT M4 --> OUT M5 --> OUT

Five roles for five load types:

planner — structural decisions, sequencing and precision over JSON
reasoning specialist — heavy justification, architectural reasoning
long-context — summarizing documents and long tickets
executor — short patches, routine operations
fast chat — conversational replies, simple answers

The router looks at the phase type and artifact complexity, and routes the call to the matching model. Locally, with no lock-in to a single provider. This is about cost, and also about the fact that different models are good at different things — there is no point producing a structural plan with a model tuned for long-form dialogue.

Principles

Contract discipline beats generation speed. Free-form text between phases is an invitation to drift.
Do not trust self-report. VERIFY lives outside the agent perimeter.
Stop explicitly. BLOCKED is a status, not a fallback. If something does not add up — the task waits for a human. It does not close as “Done”.
Specialization beats universality. A pipeline of specialized models is cheaper and more stable than one large model doing everything.

Closing

Agent autonomy is not freedom. It is a contract. The tighter the contract and the more independent the verification, the less drift and the fewer silent closes.

Knowing when to stop is a skill. A pipeline that can say “I don’t know” is safer than a pipeline that always says “done”.

Architectural question: where in your process is the agent’s self-report currently treated as truth, and what is the smallest step that would put an independent check next to it?