AI / LLM
Harness Design: Planner, Generator, Evaluator for Production LLM Agents
A single ReAct agent is enough to demonstrate a pattern. A system that runs for hours, implements features across a codebase, and produces artifacts that humans will inspect and deploy is a different scale of problem. At that scale, a flat agent breaks down: context fills, direction drifts, and the observable behavior stops matching the stated goal.
Harness design is the discipline of building the components around a language model that turn a research prototype into a production system. Anthropic's recommended harness for long-running applications is a three-agent architecture inspired by generative adversarial networks: a Planner that expands a short prompt into a specification, a Generator that implements features iteratively, and an Evaluator that interacts with the running system as a user would and returns feedback (Anthropic, 2024). The pattern is specific, opinionated, and widely adopted. The rest of this article describes the three components, the "sprint contract" that keeps them aligned, and the design principle Anthropic articulates about when each component can be removed.
Three roles
flowchart LR
REQ([Short user prompt]) --> P[Planner]
P --> SPEC[Specification]
SPEC --> G[Generator]
G --> IMPL[Implementation]
IMPL --> E[Evaluator]
E --> FB[Structured feedback]
FB --> G
E -->|passes| OUT([Done])
The Planner takes a short prompt (one to four sentences) and expands it into a comprehensive specification. The specification names the scope, the sub-tasks, the acceptance criteria, and the constraints. The planner runs once per major task, not once per step; its output is the contract the generator works against.
The Generator implements the specification. It works in sprints, one feature at a time, and produces artifacts: code, configuration, documentation, or whatever the task demands. The generator is the agent with the most tool access; it reads and writes files, runs tests, and produces the observable output.
The Evaluator interacts with the running system. Anthropic's reference implementation uses Playwright via MCP to drive a browser and verify behavior end to end. The evaluator is deliberately not the same model or the same role as the generator; its mandate is to find issues, not to fix them. When the evaluator passes the artifact, the sprint is done. When it fails, the feedback returns to the generator for revision.
The shape is explicitly inspired by GANs. The generator produces; the evaluator criticizes; the two improve each other through feedback. Anthropic notes that the same dynamic that drives GAN training, a critic that is tuned to be harsher than the generator's self-assessment, drives this architecture.
The sprint contract
Before implementation begins, the generator and evaluator negotiate a sprint contract. The contract defines what success looks like for the current feature: which user-facing behaviors must work, which edge cases must be handled, and which measurable criteria (tests passing, response times, error rates) gate acceptance. The contract is written down as a file both agents can read.
The contract matters because it constrains both sides. The generator cannot define success by its own output; the evaluator cannot move the goalposts after the work is done. When the contract is negotiated up front and both agents commit to it, the iteration loop converges faster and produces artifacts that match intent rather than artifacts that match whatever the agents drifted into.
Inter-agent communication via files
Anthropic's reference implementation uses a specific communication pattern: one agent writes a file, another agent reads it and responds in that file or a new one. The pattern has two properties worth naming. First, the message log becomes a durable artifact. Every decision, every specification, every feedback cycle is on disk. Debugging is a matter of reading files, not replaying model calls. Second, the file becomes the handoff contract. An agent that writes the sprint contract file is committing to it; an agent that reads it is bound by it.
The alternative (streaming messages directly between agents) is faster but less durable. For long-running systems, durability wins.
Two versions in code
The excerpt below sketches the three-role architecture without a framework. Each role is an LLM call; the sprint contract and the artifact pass between them as structured outputs.
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI()
class Spec(BaseModel):
scope: str
subtasks: list[str]
acceptance_criteria: list[str]
class Feedback(BaseModel):
verdict: str # "pass" or "revise"
issues: list[str]
def planner(prompt: str) -> Spec:
r = client.responses.parse(
model="gpt-4o-mini",
instructions=("Expand the user's short prompt into a full specification "
"with scope, 3-5 subtasks, and acceptance criteria."),
input=prompt, text_format=Spec)
return r.output[0].content[0].parsed
def generator(spec: Spec, feedback: Feedback | None = None) -> str:
fb = ""
if feedback is not None:
fb = "\n\nIssues from prior attempt:\n" + "\n".join(f"- {i}" for i in feedback.issues)
r = client.responses.create(
model="gpt-4o-mini",
instructions="Implement the specification. Return the artifact.",
input=f"Spec: {spec.model_dump_json()}{fb}")
return r.output_text
def evaluator(artifact: str, spec: Spec) -> Feedback:
r = client.responses.parse(
model="gpt-4o-mini",
instructions=("Act as a strict evaluator. Check the artifact against "
"the acceptance criteria. Find issues before passing."),
input=f"Artifact:\n{artifact}\n\nCriteria: {spec.acceptance_criteria}",
text_format=Feedback)
return r.output[0].content[0].parsed
def harness(prompt: str, max_sprints: int = 3) -> str:
spec = planner(prompt)
artifact = generator(spec)
for _ in range(max_sprints):
fb = evaluator(artifact, spec)
if fb.verdict == "pass":
return artifact
artifact = generator(spec, feedback=fb)
return artifact
The LangGraph version wires the three roles as nodes. The planner runs once; the generator and evaluator loop with a max-sprint cap. Checkpointing lets humans inspect between sprints.
from typing import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import InMemorySaver
from langchain.chat_models import init_chat_model
model = init_chat_model("gpt-4o-mini")
class State(TypedDict):
prompt: str
spec: Spec
artifact: str
feedback: Feedback
sprint: int
def plan_node(s: State) -> State:
return {**s, "spec": planner(s["prompt"]), "sprint": 0}
def gen_node(s: State) -> State:
fb = s.get("feedback")
return {**s, "artifact": generator(s["spec"], fb), "sprint": s["sprint"] + 1}
def eval_node(s: State) -> State:
return {**s, "feedback": evaluator(s["artifact"], s["spec"])}
def cont(s: State) -> str:
if s["feedback"].verdict == "pass" or s["sprint"] >= 3:
return "done"
return "generate"
graph = (StateGraph(State)
.add_node("plan", plan_node).add_node("generate", gen_node)
.add_node("evaluate", eval_node)
.add_edge(START, "plan").add_edge("plan", "generate")
.add_edge("generate", "evaluate")
.add_conditional_edges("evaluate", cont,
{"generate": "generate", "done": END})
.compile(checkpointer=InMemorySaver()))
Full runnable versions will live at github.com/subodhjena/agentic-patterns under examples/26_harness_design.py as that lesson lands.
The stress-test principle
Anthropic's harness writeup includes a design principle worth quoting: "Every component in a harness encodes an assumption about what the model cannot do on its own, and those assumptions are worth stress testing as models improve" (Anthropic, 2024).
The implication is direct. The planner exists because current models cannot reliably self-scope a task; when they can, the planner can be removed. The evaluator exists because current models cannot reliably self-evaluate; when they can, the evaluator can be merged into the generator. The sprint contract exists because agents drift without explicit commitments; when they stop drifting, the contract can be implicit.
None of these is true today. All of them are testable. A team that builds a harness and then forgets to re-measure its components against model improvements is carrying architecture that the current model does not need. The discipline is to check annually, or after each major model upgrade, whether each harness component still earns its cost.
Where the harness breaks down
The failures are specific to the three-role structure.
Planner that over-specifies. A specification that names every tool call and every parameter removes the generator's latitude to solve the problem. The symptom is a generator that produces a literal transcription of the specification rather than a working artifact. Tune the planner toward scope and criteria, not implementation details.
Evaluator that approves the first artifact. An evaluator that lacks skepticism terminates the loop on a mediocre artifact. The same failure mode appears in evaluator-optimizer (Workflows stage) and is fixed the same way: tune the evaluator prompt toward finding issues; require concrete evidence for passing verdicts.
Sprint contract that is not enforced. A contract that neither agent reads or commits to is paper. Inter-agent communication via files makes the contract visible; agents that are prompted to re-read the contract before each action stay aligned.
Indefinite sprint loops. Without a max-sprint cap, the generator and evaluator can ping-pong on minor issues forever. Cap sprints; return the best artifact when the cap is reached.
File sprawl. The file-based communication pattern produces many files over a long run. Without a naming convention and cleanup, the directory becomes unusable. Adopt conventions from day one.
Mixing roles. The generator that starts evaluating its own work becomes an evaluator-optimizer, which is a different pattern. The evaluator that rewrites the artifact becomes a second generator. Keep roles strict; when they blur, the GAN-like dynamic that makes the harness work collapses.
Trade against a single ReAct agent
A harness is more complex than a plain ReAct agent. The table names when the complexity pays off.
| Axis | Single ReAct agent | Planner-Generator-Evaluator harness |
|---|---|---|
| Components | One | Three, plus contracts |
| Context per turn | Full conversation | Scoped per role |
| Durability | Message log | Files on disk |
| Iteration | Inside the loop | Across sprints |
| Best-fit tasks | Short, tool-heavy | Long-running, artifact-producing |
| Cost overhead | Minimal | Planner call plus evaluator calls |
| Failure diagnosis | Trace reading | File inspection |
Reach for the harness when the task is long-lived, the artifact is durable, and the team wants a clear separation between planning, execution, and review. For short tasks, a plain ReAct agent is cheaper and simpler.
Neighbors in the series
Evaluator-optimizer, in the Workflows stage, is the two-role predecessor: no planner, just generator and evaluator. Plan-and-execute, in the Agents stage, is a lighter version of this harness for shorter tasks. Persistence and checkpointing, in the Memory stage, is the infrastructure that makes long-running harnesses durable. Agent evaluation, the previous article, provides the calibration tools for measuring whether the harness is improving.
References
- Anthropic. Building effective agents. December 2024.
- Anthropic. A practical guide to building agents. March 2025.
- Goodfellow, Ian, et al. Generative Adversarial Networks. NeurIPS 2014.
- LangChain. Multi-agent harnesses in LangGraph. 2024.
- Madaan, Aman, et al. Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023.
