Blog

Writing

Life, tech, and everything in-between.

Agent Evaluation: LLM-as-Judge, Pass-at-K, and Benchmarks

May 6, 20268 minAI

An agent that cannot be measured cannot be improved. Evaluation starts with twenty queries and an LLM-as-judge, scales up through pass-at-k metrics and standard benchmarks, and never trusts any single layer alone.