Agent Evaluation: LLM-as-Judge, Pass-at-K, and Benchmarks
·8 min·AI
An agent that cannot be measured cannot be improved. Evaluation starts with twenty queries and an LLM-as-judge, scales up through pass-at-k metrics and standard benchmarks, and never trusts any single layer alone.