AI / LLM

Agent Evaluation: LLM-as-Judge, Pass-at-K, and Benchmarks

May 6, 20268 min readAILLM

The most consistent piece of advice in Anthropic's agent engineering writeups is to build the evaluation harness before the agent. Teams that skip this step ship agents they cannot tell apart from their regressions. Teams that do it right can reliably answer the question "is today's version better than yesterday's?" and make the decisions that depend on that answer.

This article covers the three layers of agent evaluation: the twenty-query starting harness with an LLM-as-judge, the pass-at-k family of metrics that captures variance, and the standard benchmarks that let teams compare against the field. Anthropic's guidance, OpenAI's evaluation documentation, and the academic benchmarks (SWE-bench, GAIA, AgentBench, WebArena, ToolBench) all converge on the same framing: no single layer catches everything, so combine them (Anthropic, 2024; OpenAI, 2024).

Start with twenty queries

Anthropic's guidance is specific: start with roughly twenty queries, not two hundred. Early-stage development shows dramatic effect sizes (30 to 80 percent improvements from single prompt tweaks), and twenty queries are enough to measure those effect sizes with confidence. Two hundred queries is expensive to curate, expensive to run, and slow to iterate on before the agent is even working.

The twenty queries should cover the golden path and the edge cases the team already knows about. They are not a benchmark; they are a calibration set. When the effect sizes shrink (the agent is doing well enough that small changes produce small differences), the harness expands. Two hundred queries come later, not first.

LLM-as-judge

flowchart LR
    AO[Agent output] --> J[LLM judge]
    GT[Ground truth or criteria] --> J
    J --> SCORE[0.0 to 1.0 score]
    J --> REASON[Explanation]

A judge is a separate LLM call with a scoring prompt. It reads the agent's output, compares it against the ground truth or criteria, and emits a score between zero and one plus a short explanation. Anthropic reports that LLM-as-judge "proved most consistent and aligned with human judgements" in their experience (Anthropic, 2024).

The judge is not perfect; it inherits its own model's biases and miscalibration. Two mitigations matter. First, calibrate the judge on a held-out set that humans have graded, and measure judge-human agreement before trusting it. Second, use the judge as one layer among several. Production-grade evaluation follows the Swiss Cheese Model: several imperfect layers whose failure modes are uncorrelated.

Automated evaluations run on every commit, using the judge and any deterministic checks the team has in place. They are fast and unlimited; they measure the agent against the twenty-query set or its expansion.

Production monitoring captures ground truth from real usage: actual outcomes, user ratings, downstream conversions. Production data catches what automated evaluations do not, because real queries differ from curated ones.

Periodic human review calibrates the automated layer. A handful of traces graded by humans every week surfaces whether the judge is drifting. When the judge and humans disagree, the judge prompt needs revision.

Pass-at-k and pass-to-the-k

Single-run accuracy is misleading for agents because agent runs are stochastic. Two metrics from the research literature handle this directly.

Pass-at-k is the probability that at least one of k runs produces a correct answer. It rewards systems that can solve a problem eventually, even if they are unreliable on any individual attempt. It is the right metric when the user can retry, or when the agent has an internal verification step that picks the correct answer from several attempts.

Pass-to-the-k (written pass^k) is the probability that all k runs produce a correct answer. It is a stricter metric that measures robustness. It is the right metric when the user sees exactly one output and that output must be correct.

The two metrics measure different things. A system with 80 percent pass-at-1 and 40 percent pass-at-10 has different failure modes than a system with 60 percent pass-at-1 and 90 percent pass-at-5. Reporting both, or a variant appropriate to the product, is standard in benchmark papers.

Beyond accuracy, track transcript statistics: number of turns, number of tool calls, token usage, wall-clock latency, and task-specific metrics (did the code compile, did the deployment succeed, did the user accept the response).

Benchmarks worth knowing

Several benchmarks have become reference points in the agentic evaluation literature. Each tests a different capability.

Benchmark	What it tests	Reference result
SWE-bench	Resolving real GitHub issues with code patches	Top systems: 30 to 50 percent on the Lite split
GAIA	Multi-step reasoning plus tool use across 466 questions	Humans: 92 percent; GPT-4 plus plugins: 15 percent on Level 3
AgentBench	Eight environments (OS, DB, web, games)	Commercial models substantially outperform open-source models
WebArena	Self-hosted websites (e-commerce, GitLab)	Top agents: approximately 35 percent
ToolBench	Over 16,000 real-world APIs across 49 categories	Tests API selection and tool chaining

Benchmarks are useful for comparing approaches at a point in time; they are less useful for driving day-to-day improvement. A team building a customer service agent does not optimize for SWE-bench. Use benchmarks to anchor the team's sense of where the field is; use the internal twenty-query set (and its expansion) to drive iteration.

Two versions in code

The excerpt below is an LLM-as-judge with a pass-at-k harness, without a framework. The judge returns a typed verdict; the harness runs the agent k times and aggregates.

from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal

client = OpenAI()

class Verdict(BaseModel):
    score: float = Field(ge=0.0, le=1.0)
    verdict: Literal["correct", "partial", "incorrect"]
    reasoning: str

def judge(query: str, answer: str, criteria: str) -> Verdict:
    r = client.responses.parse(
        model="gpt-4o-mini",
        instructions=("Compare the answer against the criteria. Score 0.0-1.0. "
                      "Return 'correct' only if fully matches criteria."),
        input=f"Query: {query}\nAnswer: {answer}\nCriteria: {criteria}",
        text_format=Verdict)
    return r.output[0].content[0].parsed

def pass_at_k(agent, queries: list[dict], k: int = 5) -> dict:
    results = []
    for q in queries:
        attempts = [agent(q["query"]) for _ in range(k)]
        scored = [judge(q["query"], a, q["criteria"]) for a in attempts]
        any_correct = any(v.verdict == "correct" for v in scored)
        all_correct = all(v.verdict == "correct" for v in scored)
        mean_score = sum(v.score for v in scored) / len(scored)
        results.append({"query": q["query"], "pass@k": any_correct,
                        "pass^k": all_correct, "mean_score": mean_score})
    return {
        "pass@k": sum(r["pass@k"] for r in results) / len(results),
        "pass^k": sum(r["pass^k"] for r in results) / len(results),
        "mean_score": sum(r["mean_score"] for r in results) / len(results),
        "details": results,
    }

The LangChain version uses LangSmith for tracing and evaluation. The evaluate helper runs a dataset against an agent and applies both custom and LLM-based evaluators.

from langsmith import Client, evaluate
from langsmith.evaluation import LangChainStringEvaluator
from langchain.chat_models import init_chat_model

client = Client()
dataset_name = "support-queries-v1"

qa_evaluator = LangChainStringEvaluator(
    "qa", config={"llm": init_chat_model("gpt-4o-mini")})

def target(inputs: dict) -> dict:
    return {"answer": agent.invoke({"messages": [("user", inputs["query"])]})
                            ["messages"][-1].content}

results = evaluate(target, data=dataset_name,
                   evaluators=[qa_evaluator],
                   experiment_prefix="agent-v2-",
                   num_repetitions=5)  # for pass-at-k-style variance

Full runnable versions will live at github.com/subodhjena/agentic-patterns under examples/25_evaluation.py as that lesson lands.

Where evaluation goes wrong

The anti-patterns below appear across teams and are specifically called out in Anthropic's evaluation guidance.

Brittle grading. Exact step-sequence validation penalizes valid alternatives. If the agent produces a correct answer through a different path than the reference solution, strict grading marks it wrong. Use outcome-based grading when possible; grade the destination, not the route.

Ambiguous specifications. An agent cannot succeed at a task if success is not well-defined. When the criteria are fuzzy, human-judge agreement is poor and automated judges are useless. Write concrete criteria first; only then measure against them.

Not reading transcripts. Summary metrics hide the reasons the agent fails. A team that only looks at aggregate pass rate misses patterns in the failures that transcript inspection reveals. Schedule time to read failed traces on every release.

One-sided evaluations. Testing only the happy path produces an agent that works on the cases the team thought of and fails on the cases they did not. Curate the evaluation set to include ambiguous inputs, adversarial inputs, and edge cases the agent must handle.

Saturation blindness. Rigid grading penalizes "96.12" when the expected answer is "96.124991..." A judge tuned to the task's tolerance avoids this, but automated string-match grading does not.

Judge drift. The judge is itself a language model, and its behavior can drift across model versions. When the judge's behavior changes, the measured agent performance appears to change without the agent itself changing. Pin the judge model version in experiments; recalibrate when it changes.

Benchmark over-fitting. Optimizing heavily against one benchmark produces an agent that performs well on that benchmark and worse in the wild. Use several benchmarks and regular production monitoring.

Trade against skipping evaluation

The cost of not evaluating is larger than it looks on day one.

Axis	No evaluation	Full evaluation stack
Iteration speed	Fast, blind	Slower per change, informed
Regression detection	Reactive, in production	Proactive, before deploy
Cost of evaluation	Zero	Judge tokens, human review hours
Confidence in changes	Low	High
Ability to ship	High in the short term	Sustained over time

Skipping evaluation accelerates early development and slows everything after. Production agents that were shipped without an evaluation harness are agents that are stuck at their current performance, because every change is a guess.

Neighbors in the series

Guardrails, the previous article, is the safety-focused companion of evaluation; guardrails catch unsafe behavior at runtime, evaluation catches it at development time. Evaluator-optimizer, in the Workflows stage, uses an evaluator inside the agent's inner loop; agent evaluation uses a similar judge in the outer loop. Harness design, the next article, describes the planner-generator-evaluator architecture that uses evaluation as a first-class component. The decision framework article at the end of the series uses evaluation data to justify pattern choices.

References

Anthropic. Building effective agents. December 2024.
OpenAI. Evaluation and iteration in agent workflows. 2024.
Jimenez, Carlos, et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. ICLR 2024.
Mialon, Gregoire, et al. GAIA: A Benchmark for General AI Assistants. 2023.
Zheng, Lianmin, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023.

agentic-patterns evaluation llm-as-judge pass-at-k benchmarks ai llm

← Back to all posts