AI / LLM
Agent Evaluation: LLM-as-Judge, Pass-at-K, and Benchmarks
The most consistent piece of advice in Anthropic's agent engineering writeups is to build the evaluation harness before the agent. Teams that skip this step ship agents they cannot tell apart from their regressions. Teams that do it right can reliably answer the question "is today's version better than yesterday's?" and make the decisions that depend on that answer.
This article covers the three layers of agent evaluation: the twenty-query starting harness with an LLM-as-judge, the pass-at-k family of metrics that captures variance, and the standard benchmarks that let teams compare against the field. Anthropic's guidance, OpenAI's evaluation documentation, and the academic benchmarks (SWE-bench, GAIA, AgentBench, WebArena, ToolBench) all converge on the same framing: no single layer catches everything, so combine them (Anthropic, 2024; OpenAI, 2024).
Start with twenty queries
Anthropic's guidance is specific: start with roughly twenty queries, not two hundred. Early-stage development shows dramatic effect sizes (30 to 80 percent improvements from single prompt tweaks), and twenty queries are enough to measure those effect sizes with confidence. Two hundred queries is expensive to curate, expensive to run, and slow to iterate on before the agent is even working.
The twenty queries should cover the golden path and the edge cases the team already knows about. They are not a benchmark; they are a calibration set. When the effect sizes shrink (the agent is doing well enough that small changes produce small differences), the harness expands. Two hundred queries come later, not first.
LLM-as-judge
flowchart LR
AO[Agent output] --> J[LLM judge]
GT[Ground truth or criteria] --> J
J --> SCORE[0.0 to 1.0 score]
J --> REASON[Explanation]
A judge is a separate LLM call with a scoring prompt. It reads the agent's output, compares it against the ground truth or criteria, and emits a score between zero and one plus a short explanation. Anthropic reports that LLM-as-judge "proved most consistent and aligned with human judgements" in their experience (Anthropic, 2024).
The judge is not perfect; it inherits its own model's biases and miscalibration. Two mitigations matter. First, calibrate the judge on a held-out set that humans have graded, and measure judge-human agreement before trusting it. Second, use the judge as one layer among several. Production-grade evaluation follows the Swiss Cheese Model: several imperfect layers whose failure modes are uncorrelated.
Automated evaluations run on every commit, using the judge and any deterministic checks the team has in place. They are fast and unlimited; they measure the agent against the twenty-query set or its expansion.
Production monitoring captures ground truth from real usage: actual outcomes, user ratings, downstream conversions. Production data catches what automated evaluations do not, because real queries differ from curated ones.
Periodic human review calibrates the automated layer. A handful of traces graded by humans every week surfaces whether the judge is drifting. When the judge and humans disagree, the judge prompt needs revision.
Pass-at-k and pass-to-the-k
Single-run accuracy is misleading for agents because agent runs are stochastic. Two metrics from the research literature handle this directly.
Pass-at-k is the probability that at least one of k runs produces a correct answer. It rewards systems that can solve a problem eventually, even if they are unreliable on any individual attempt. It is the right metric when the user can retry, or when the agent has an internal verification step that picks the correct answer from several attempts.
Pass-to-the-k (written pass^k) is the probability that all k runs produce a correct answer. It is a stricter metric that measures robustness. It is the right metric when the user sees exactly one output and that output must be correct.
The two metrics measure different things. A system with 80 percent pass-at-1 and 40 percent pass-at-10 has different failure modes than a system with 60 percent pass-at-1 and 90 percent pass-at-5. Reporting both, or a variant appropriate to the product, is standard in benchmark papers.
Beyond accuracy, track transcript statistics: number of turns, number of tool calls, token usage, wall-clock latency, and task-specific metrics (did the code compile, did the deployment succeed, did the user accept the response).
Benchmarks worth knowing
Several benchmarks have become reference points in the agentic evaluation literature. Each tests a different capability.
| Benchmark | What it tests | Reference result |
|---|---|---|
| SWE-bench | Resolving real GitHub issues with code patches | Top systems: 30 to 50 percent on the Lite split |
| GAIA | Multi-step reasoning plus tool use across 466 questions | Humans: 92 percent; GPT-4 plus plugins: 15 percent on Level 3 |
| AgentBench | Eight environments (OS, DB, web, games) | Commercial models substantially outperform open-source models |
| WebArena | Self-hosted websites (e-commerce, GitLab) | Top agents: approximately 35 percent |
| ToolBench | Over 16,000 real-world APIs across 49 categories | Tests API selection and tool chaining |
Benchmarks are useful for comparing approaches at a point in time; they are less useful for driving day-to-day improvement. A team building a customer service agent does not optimize for SWE-bench. Use benchmarks to anchor the team's sense of where the field is; use the internal twenty-query set (and its expansion) to drive iteration.
Two versions in code
The excerpt below is an LLM-as-judge with a pass-at-k harness, without a framework. The judge returns a typed verdict; the harness runs the agent k times and aggregates.
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal
client = OpenAI()
class Verdict(BaseModel):
score: float = Field(ge=0.0, le=1.0)
verdict: Literal["correct", "partial", "incorrect"]
reasoning: str
def judge(query: str, answer: str, criteria: str) -> Verdict:
r = client.responses.parse(
model="gpt-4o-mini",
instructions=("Compare the answer against the criteria. Score 0.0-1.0. "
"Return 'correct' only if fully matches criteria."),
input=f"Query: {query}\nAnswer: {answer}\nCriteria: {criteria}",
text_format=Verdict)
return r.output[0].content[0].parsed
def pass_at_k(agent, queries: list[dict], k: int = 5) -> dict:
results = []
for q in queries:
attempts = [agent(q["query"]) for _ in range(k)]
scored = [judge(q["query"], a, q["criteria"]) for a in attempts]
any_correct = any(v.verdict == "correct" for v in scored)
all_correct = all(v.verdict == "correct" for v in scored)
mean_score = sum(v.score for v in scored) / len(scored)
results.append({"query": q["query"], "pass@k": any_correct,
"pass^k": all_correct, "mean_score": mean_score})
return {
"pass@k": sum(r["pass@k"] for r in results) / len(results),
"pass^k": sum(r["pass^k"] for r in results) / len(results),
"mean_score": sum(r["mean_score"] for r in results) / len(results),
"details": results,
}
The LangChain version uses LangSmith for tracing and evaluation. The evaluate helper runs a dataset against an agent and applies both custom and LLM-based evaluators.
from langsmith import Client, evaluate
from langsmith.evaluation import LangChainStringEvaluator
from langchain.chat_models import init_chat_model
client = Client()
dataset_name = "support-queries-v1"
qa_evaluator = LangChainStringEvaluator(
"qa", config={"llm": init_chat_model("gpt-4o-mini")})
def target(inputs: dict) -> dict:
return {"answer": agent.invoke({"messages": [("user", inputs["query"])]})
["messages"][-1].content}
results = evaluate(target, data=dataset_name,
evaluators=[qa_evaluator],
experiment_prefix="agent-v2-",
num_repetitions=5) # for pass-at-k-style variance
Full runnable versions will live at github.com/subodhjena/agentic-patterns under examples/25_evaluation.py as that lesson lands.
Where evaluation goes wrong
The anti-patterns below appear across teams and are specifically called out in Anthropic's evaluation guidance.
Brittle grading. Exact step-sequence validation penalizes valid alternatives. If the agent produces a correct answer through a different path than the reference solution, strict grading marks it wrong. Use outcome-based grading when possible; grade the destination, not the route.
Ambiguous specifications. An agent cannot succeed at a task if success is not well-defined. When the criteria are fuzzy, human-judge agreement is poor and automated judges are useless. Write concrete criteria first; only then measure against them.
Not reading transcripts. Summary metrics hide the reasons the agent fails. A team that only looks at aggregate pass rate misses patterns in the failures that transcript inspection reveals. Schedule time to read failed traces on every release.
One-sided evaluations. Testing only the happy path produces an agent that works on the cases the team thought of and fails on the cases they did not. Curate the evaluation set to include ambiguous inputs, adversarial inputs, and edge cases the agent must handle.
Saturation blindness. Rigid grading penalizes "96.12" when the expected answer is "96.124991..." A judge tuned to the task's tolerance avoids this, but automated string-match grading does not.
Judge drift. The judge is itself a language model, and its behavior can drift across model versions. When the judge's behavior changes, the measured agent performance appears to change without the agent itself changing. Pin the judge model version in experiments; recalibrate when it changes.
Benchmark over-fitting. Optimizing heavily against one benchmark produces an agent that performs well on that benchmark and worse in the wild. Use several benchmarks and regular production monitoring.
Trade against skipping evaluation
The cost of not evaluating is larger than it looks on day one.
| Axis | No evaluation | Full evaluation stack |
|---|---|---|
| Iteration speed | Fast, blind | Slower per change, informed |
| Regression detection | Reactive, in production | Proactive, before deploy |
| Cost of evaluation | Zero | Judge tokens, human review hours |
| Confidence in changes | Low | High |
| Ability to ship | High in the short term | Sustained over time |
Skipping evaluation accelerates early development and slows everything after. Production agents that were shipped without an evaluation harness are agents that are stuck at their current performance, because every change is a guess.
Neighbors in the series
Guardrails, the previous article, is the safety-focused companion of evaluation; guardrails catch unsafe behavior at runtime, evaluation catches it at development time. Evaluator-optimizer, in the Workflows stage, uses an evaluator inside the agent's inner loop; agent evaluation uses a similar judge in the outer loop. Harness design, the next article, describes the planner-generator-evaluator architecture that uses evaluation as a first-class component. The decision framework article at the end of the series uses evaluation data to justify pattern choices.
References
- Anthropic. Building effective agents. December 2024.
- OpenAI. Evaluation and iteration in agent workflows. 2024.
- Jimenez, Carlos, et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. ICLR 2024.
- Mialon, Gregoire, et al. GAIA: A Benchmark for General AI Assistants. 2023.
- Zheng, Lianmin, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023.
