AI / LLM

Evaluator-Optimizer: Iterative Refinement with a Separate Critic LLM

7 min readAILLM

A first-draft response from a language model is often serviceable and rarely excellent. Code compiles but fails tests. Translations convey meaning but miss nuance. Marketing copy hits the brief but reads flat. The quality gap is usually visible to a competent reader; it is less reliably visible to the model that produced the draft. Closing the gap in a single prompt is possible but hard. Closing it with a second pass is usually easier, provided the second pass is performed by a different LLM call with its own mandate.

The evaluator-optimizer pattern formalizes that second pass. A generator LLM produces a draft. A separate evaluator LLM scores it against explicit criteria and returns feedback. If the score clears a threshold, the draft is the final output. Otherwise the feedback is folded back into the generator's next prompt, and the loop repeats. Anthropic lists this as the fifth workflow pattern in its agentic design guide and reports a specific failure mode that makes the two-LLM structure necessary: generators asked to evaluate their own work "tend to respond by confidently praising the work, even when, to a human observer, the quality is obviously mediocre" (Anthropic, 2024).

Why self-evaluation fails

A generator primed to produce a result is not calibrated to find flaws in that result. The same mechanism that improves fluency of output tends to smooth over defects rather than flag them. A separate evaluator LLM, given a skeptical prompt and no ownership of the draft, can afford to be blunt. Anthropic notes that "tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work." The pattern's core move is this split: two prompts, two roles, two calls.

The evaluator's job is not to produce a better draft. It is to produce a verdict and a diagnosis. The generator's job is to use that diagnosis on the next pass. Keeping the roles clean is what makes the loop converge rather than oscillate.

The refinement loop

flowchart LR
    IN([Input + criteria]) --> G[Generator LLM]
    G --> D[Draft]
    D --> E[Evaluator LLM]
    E -->|score below threshold| FB[Structured feedback]
    FB --> G
    E -->|score at or above threshold| OUT([Final output])
    E -->|max iterations reached| STOP([Return best draft])

The loop has a budget. Without a maximum iteration count, the pattern can run indefinitely on inputs the evaluator will never approve. With it, the worst case is bounded: return the best draft seen so far when the budget runs out. A three-iteration cap is a reasonable default; most real gains arrive in the first refinement.

The feedback itself is structured. A free-text critique is harder for the generator to act on than a typed object with explicit strengths, weaknesses, and a verdict. Structured output, covered earlier in this series, is the contract that makes this reliable.

Two versions in code

The excerpt below shows the pattern without a framework. The evaluator returns a typed Evaluation with a score and a verdict; the loop exits when the verdict is pass or the iteration budget runs out.

from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal

client = OpenAI()

class Evaluation(BaseModel):
    score: float = Field(ge=0.0, le=1.0)
    strengths: list[str]
    weaknesses: list[str]
    verdict: Literal["pass", "needs_revision", "fail"]

def generate(task: str, feedback: str = "") -> str:
    prompt = task if not feedback else f"{task}\n\nPrevious feedback:\n{feedback}"
    return client.responses.create(
        model="gpt-4o-mini",
        instructions="Produce the result. Address feedback if provided.",
        input=prompt,
    ).output_text

def evaluate(draft: str, criteria: str) -> Evaluation:
    r = client.responses.parse(
        model="gpt-4o-mini",
        instructions=f"Evaluate strictly against: {criteria}. "
                     "Score >= 0.8 means genuinely excellent.",
        input=draft, text_format=Evaluation,
    )
    return r.output[0].content[0].parsed

def refine(task: str, criteria: str, max_rounds: int = 3) -> str:
    draft = generate(task)
    best = draft
    best_score = 0.0
    for _ in range(max_rounds):
        e = evaluate(draft, criteria)
        if e.score > best_score:
            best, best_score = draft, e.score
        if e.verdict == "pass":
            return draft
        feedback = "\n".join(f"- {w}" for w in e.weaknesses)
        draft = generate(task, feedback)
    return best

The LangGraph version expresses the loop as a state graph with a conditional edge that routes back to the generator when the verdict requires revision. The best_score pattern lets the graph return the best draft seen even if the threshold is never cleared.

from typing import TypedDict
from langgraph.graph import StateGraph, START, END
from langchain.chat_models import init_chat_model

model = init_chat_model("gpt-4o-mini")

class State(TypedDict):
    task: str; criteria: str
    draft: str; evaluation: Evaluation
    rounds: int; best: str; best_score: float

def gen_node(s: State) -> State:
    prior = "" if s["rounds"] == 0 else "\n".join(f"- {w}" for w in s["evaluation"].weaknesses)
    prompt = s["task"] if not prior else f"{s['task']}\n\nFeedback:\n{prior}"
    return {**s, "draft": model.invoke(prompt).content, "rounds": s["rounds"] + 1}

def eval_node(s: State) -> State:
    e = model.with_structured_output(Evaluation).invoke(
        f"Evaluate against {s['criteria']}. >=0.8 means excellent.\n\n{s['draft']}")
    best, best_score = (s["draft"], e.score) if e.score > s["best_score"] \
                       else (s["best"], s["best_score"])
    return {**s, "evaluation": e, "best": best, "best_score": best_score}

def should_continue(s: State) -> str:
    if s["evaluation"].verdict == "pass": return "done"
    if s["rounds"] >= 3: return "done"
    return "refine"

graph = (StateGraph(State)
         .add_node("generate", gen_node).add_node("evaluate", eval_node)
         .add_edge(START, "generate").add_edge("generate", "evaluate")
         .add_conditional_edges("evaluate", should_continue,
                                {"refine": "generate", "done": END})
         .compile())

Full runnable versions are at github.com/subodhjena/agentic-patterns under examples/10_evaluator_optimizer.py and examples/10_evaluator_optimizer_langgraph.py.

When the loop actually improves things

Evaluator-optimizer has a narrow sweet spot. The pattern helps when three conditions hold at once.

  • The evaluation criteria are clear and measurable. "Is this code passing tests?" is measurable. "Is this translation nuanced?" is also measurable if the evaluator is given examples of nuance. "Is this response good?" is not measurable and will produce drift.
  • Iterative refinement demonstrably improves quality. Some tasks improve monotonically with feedback; others oscillate. If the second iteration is worse than the first on representative inputs, the loop is hurting. Measure before enabling.
  • The evaluator has a sharper prompt than a one-shot quality check would allow. If a single prompt that asks the generator to "produce a careful, self-reviewed answer" matches the evaluator-optimizer output, the pattern is overhead.

The canonical successful applications are code generation paired with test execution, literary translation paired with a skeptical evaluator, and structured document authoring paired with a schema-aware checker. In each, the evaluator has a concrete signal.

Where the loop goes wrong

The failure modes cluster around the evaluator's calibration, the loop's termination, and the feedback's shape.

Generous evaluator. If the evaluator accepts mediocre drafts, the loop terminates early and the output is shallow. Tune the evaluator prompt to be skeptical. A useful trick is to instruct the evaluator that a score above 0.8 requires genuine excellence and that most drafts should fall between 0.5 and 0.75.

Oscillation. Successive drafts flip between two flaws rather than converging. The symptom is near-identical scores on consecutive rounds. Limit iterations hard; consider returning the best draft seen rather than the last.

Feedback as paraphrase. Evaluator returns rewritten drafts instead of critique. The generator now has two drafts competing for attention. Constrain the evaluator's output schema to strengths, weaknesses, and a verdict; forbid draft content in the feedback field.

Uncapped iteration. Without a max-rounds budget, the loop can run to infinity on adversarial inputs. Always cap.

Evaluator is the same model as the generator. Using the identical model with a different prompt is cheaper than using a stronger one, but drops effectiveness on hard tasks. A two-tier setup (cheap generator, stronger evaluator) often dominates both single-tier options on cost-adjusted quality.

Trade against a single well-tuned prompt

Evaluator-optimizer is a real cost increase. A single generation that scores 0.75 on the first pass and three iterations that score 0.82 on the final pass is three to four times more expensive. The trade is worth it when the quality gain is load-bearing for the use case and not otherwise.

Axis Single careful prompt Evaluator-optimizer
Cost per result One generation Up to N generations plus N evaluations
Latency Low, single round-trip Higher, proportional to iteration count
Quality ceiling Limited by the single prompt Higher when criteria are measurable
Calibration burden One prompt to tune Two prompts, one critical
Failure modes Silent under-quality Oscillation, generous evaluator, runaway loop

Neighbors in the series

Prompt chaining is the non-iterative predecessor: steps in sequence, no loop. Evaluator-optimizer is what chaining becomes when a step can veto and restart. Orchestrator-workers and evaluator-optimizer often compose: workers produce subtask drafts; an evaluator critiques each before synthesis. Reflexion, covered in the Reasoning stage, is the agent-side cousin of this pattern: rather than a separate evaluator, the agent critiques its own prior trajectory and retries. LLM-as-judge evaluation, covered in the Safety stage, uses the same two-LLM split for grading rather than refinement.

References

  1. Anthropic. Building effective agents. December 2024.
  2. Madaan, Aman, et al. Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023.
  3. Shinn, Noah, et al. Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023.
  4. LangChain. Evaluator-optimizer patterns in LangGraph. 2024.
  5. OpenAI. Evaluation and iteration in agent workflows. 2024.
agentic-patternsevaluator-optimizerworkflowsrefinementaillm
← Back to all posts