Subodh Jena
Blog

Blog

Writing

Life, tech, and everything in-between.

Agent Evaluation: LLM-as-Judge, Pass-at-K, and Benchmarks

May 6, 2026·8 min·AI

An agent that cannot be measured cannot be improved. Evaluation starts with twenty queries and an LLM-as-judge, scales up through pass-at-k metrics and standard benchmarks, and never trusts any single layer alone.

Work

ExperimentsPortfolio

Connect

AboutContact

© 2026 Subodh Jena

X (Twitter)GitHubLinkedIn