Your Agent Evals Are Vibes. Here's How to Make Them Infrastructure.

TL;DR

Most teams evaluate agents through manual spot checks and hope, and this is the primary reason agents stall at prototype stage
You don't need thousands of test cases to start; 20-50 examples drawn from real production failures create a high-signal regression suite
For agents, grading the execution path matters as much as grading the final output: did it loop unnecessarily, call the expensive tool when the cheap one would do, or take 30 steps when 5 would suffice?

I've written about the 95% Trap, how compounding accuracy across multi-step agent workflows produces system-level reliability far below what each step achieves. I've argued that evals are day-one infrastructure, not a nice-to-have for later. The same production discipline sits behind AI software quality: generated code is raw material, not a finished product. But I haven't addressed the practical question: how do you actually build agent evals when you're starting from nothing?

Most product teams default to what I call the vibe method. Manual chats. Spot checks. "Yeah, that looks right." A quick demo before the release. Hope for the best.

This is the single biggest reason so few agents graduate from prototype to production. Without evals, you're flying blind. You can't measure whether a change made things better or worse. You can't distinguish real regressions from noise. You can't test against hundreds of scenarios before shipping. You're stuck in a reactive loop, waiting for production failures, reproducing them manually, patching, and hoping nothing else broke.

That cycle is expensive and unsustainable. This is how to replace it.

Why are agents harder to evaluate than chatbots?

For a single-turn chatbot, evaluation is relatively straightforward. You give it an input, it produces an output, you check whether the output is good. The grading surface is small: one response, one judgment.

Agents are fundamentally different. They operate over many turns, calling tools, modifying state, making decisions based on intermediate results, adapting their approach as they go. The same capabilities that make agents useful (autonomy, flexibility, multi-step reasoning) make them harder to evaluate.

Mistakes compound and propagate. An agent that retrieves the wrong data in step two (a RAG pipeline problem that PMs should own, not delegate) will produce a confidently wrong analysis in step five. An agent that chooses an expensive reasoning model for a task that a cheap model could handle costs you money on every invocation. An agent that loops through twelve tool calls when three would suffice is burning tokens and latency for no benefit.

This means you need to evaluate more than just the final output. You need to evaluate the path.

Grade the path, not just the destination

For a chatbot, the final answer is what matters. For an agent, the execution trace is critical.

Consider a support agent that successfully resolves a customer's refund request. The outcome is correct: the refund was processed. But the trace reveals problems: the agent called the identity verification tool three times unnecessarily, looped back to re-read the policy document it had already retrieved, used the expensive reasoning model to format a simple confirmation email, and took 15 turns to complete a task that should take 5.

The outcome passed. The execution was terrible. And in production, that terrible execution translates directly into inference cost, latency, and user frustration.

Effective agent evals combine outcome grading with trace grading:

Outcome grading checks whether the end state is correct. Did the refund get processed? Did the code compile? Did the report contain the required information? This is binary, verifiable, and essential.

Trace grading checks how the agent got there. How many turns did it take? Which tools did it call, and were they the right ones? Did it use expensive models for tasks that cheap models could handle? Did it loop or backtrack unnecessarily? Was the interaction with the user clear and efficient?

Both dimensions matter. A correct outcome achieved wastefully is a margin problem. An efficient trace that produces the wrong outcome is a quality problem. You need to measure both.

Start with 20 examples, not 2,000

The most common reason teams delay building evals is the belief that they need a massive dataset. They think evals require hundreds or thousands of test cases, months of preparation, and a dedicated team.

They don't.

Twenty to fifty well-chosen examples are enough to create a high-signal regression suite that catches real problems. The key word is "well-chosen." Your first eval examples should come from three sources:

Your bug tracker. Every agent failure that a user has reported is a test case waiting to be written. The input that caused the failure is your test input. The correct behaviour is your expected outcome. Converting production failures into automated test cases ensures your suite reflects actual usage patterns, not hypothetical scenarios.

Your support queue. Customer complaints about agent behaviour (wrong answers, slow responses, awkward interactions) are signals about what your users actually care about. These map directly to eval criteria.

Your manual checks. Every team has pre-release rituals. The things you manually verify before every deployment ("does it still handle cancellations correctly?" "does it respond appropriately to abusive inputs?" "does it use the right tool for address lookups?") are your first evals. Script them. Automate the checks you're already doing by hand.

In the early stages of agent development, each change to the system tends to have a large, noticeable impact. Small sample sizes are sufficient to detect these large effects. As your agent matures and changes become more subtle, you'll naturally expand the eval suite. But waiting for a large dataset before you start means you're shipping without measurement during the most volatile phase of development, exactly when measurement matters most.

Pyramid building from 20 seed examples at base to capability benchmarks to regression suite at top

The two types of eval suites you need

Not all evals serve the same purpose. You need two distinct types, and conflating them leads to confusion.

Capability evals ask: "What can this agent do?" These should target tasks the agent currently struggles with. They start with low pass rates and give your team a hill to climb. When your refund agent handles straightforward cases at 95% but complex edge cases at 30%, the capability eval focuses on those edge cases. It measures progress toward a goal.

Regression evals ask: "Does the agent still handle everything it used to?" These should have near-100% pass rates. They protect against backsliding, the maddening phenomenon where fixing one behaviour breaks three others. A declining score on a regression eval is an immediate signal that something is wrong.

The lifecycle is natural: as you hill-climb on capability evals and pass rates rise above your quality threshold, those tasks graduate into the regression suite. What once measured "can we do this at all?" now measures "can we still do this reliably?"

This distinction matters operationally. Capability evals tell you where to invest. Regression evals tell you when to stop and investigate. Running both on every change gives you a complete picture of whether you're making genuine progress or just shifting problems around.

Building graders that actually work

The grading logic is where most teams get stuck. How do you automatically judge whether an agent's output is "good"?

In practice, you combine three approaches, each with different strengths.

Deterministic graders are fast, cheap, and objective. Does the output match a known pattern? Does the code compile? Does the extracted data conform to the expected schema? Does the email address have a valid format? These checks are nearly free to run and catch a large class of obvious failures. They're your first line of defence.

Model-based graders use a language model to judge the agent's output against a rubric. "Did the response show empathy?" "Was the explanation clear?" "Did the agent ground its answer in the retrieved documents?" These are more expensive and non-deterministic, but they capture nuance that deterministic checks can't. The critical requirement is calibration: your model-based grader needs to agree with human judgment on a representative sample. Without calibration, you're replacing one form of vibes with another.

Human graders are the gold standard. Subject matter experts reviewing agent outputs catch things that neither code nor models reliably detect. But they're expensive and slow. Use human grading for calibrating your automated graders, for spot-checking production outputs, and for evaluating genuinely novel scenarios where you don't yet know what "good" looks like.

The right combination depends on your agent type. Code-producing agents lean heavily on deterministic graders: does the code pass the test suite? Conversational agents lean more on model-based rubrics: was the interaction helpful and appropriate? Most production agents use a weighted combination of all three.

Embracing non-determinism

This is the part that trips up product teams with a traditional software background: agents are probabilistic. The same input won't always produce the same output. A task that passed on Monday might fail on Tuesday, not because anything changed, but because the model is stochastic.

This means binary pass/fail on a single run isn't enough. You need to understand reliability distributions.

Two metrics capture this well:

pass@k measures the probability that the agent gets at least one correct solution in k attempts. If your agent has a 70% per-trial success rate, pass@3 is about 97%. This metric matters when one success is sufficient, when you're generating candidate solutions and a human or downstream system picks the best one.

pass^k measures the probability that all k trials succeed. Same 70% agent, pass^3 drops to about 34%. This metric matters for customer-facing agents where users expect reliable behaviour every single time. It's the metric that exposes the 95% Trap. A 95% per-trial rate sounds excellent until you need consistency across multiple interactions. The spot-check architecture I've written about uses confidence scoring to route only uncertain results to expensive manager models, which makes pass^k viable at scale.

Which metric you care about depends on your product. For internal tools where a human reviews the output, pass@k is often sufficient. For autonomous customer-facing agents, pass^k is what determines whether your users trust the system. Both should be in your eval dashboard.

Evals as competitive advantage

There's a strategic dimension to evals that goes beyond quality assurance.

When a new model drops (and they're dropping every few weeks) teams without evals face weeks of manual testing to determine whether the new model is safe to adopt. Teams with evals run the suite in hours, compare results against their quality thresholds, and make a data-driven decision. They can adopt improvements faster, route tasks to better models sooner, and adjust their multi-model orchestration with confidence.

I've argued in the context of multi-model orchestration that the teams who move fastest are the ones with robust eval infrastructure. The evals don't slow you down. They're the mechanism that lets you move fast. Every model upgrade, every prompt change, every architectural modification can be tested against your quality bar automatically, at scale, before it reaches a single user.

Evals also become the highest-bandwidth communication channel between product and engineering. Instead of debating in meetings whether the agent is "good enough," you have metrics. Instead of arguing about whether a change improved things, you have before-and-after scores. The eval suite becomes the shared definition of quality that everyone can point to.

The starting point

If you have no evals today, here's the path:

List the five things you manually check before every release. Write them down. Those are your first five test cases.
Pull the last ten agent failures from your bug tracker or support queue. Convert each into a test case with a clear input and expected outcome. Those are your regression tests.
For each test case, write the simplest grader that could work. String match, schema validation, state check. Don't over-engineer the grading.
Run the suite on every change. Manually at first. Automated in CI as soon as possible.
Expand from failures, not from imagination. Every new production failure becomes a new test case. Your suite grows organically from real-world usage.

You'll have a functional eval suite in a day. It won't be comprehensive. It will catch the failures that matter most, the ones that have already hurt you. Everything after that is hill-climbing.

Evals aren't a quality phase that happens after development. They're the infrastructure that makes development possible. Without them, every change is a gamble. With them, every change is a measured experiment. I cover the full evaluation framework in the handbook, from grader design to CI integration.

Key takeaways

The "vibe method" (manual chats, spot checks, gut feel) is the single biggest reason agents stall at prototype stage. Without evals, you can't measure whether a change made things better or worse.
Twenty to fifty well-chosen test cases drawn from your bug tracker, support queue, and pre-release manual checks are enough for a high-signal regression suite. You don't need thousands of examples to start.
Agent evals must grade both outcome (did the end state meet requirements?) and trace (did the agent loop unnecessarily, use expensive models for cheap tasks, or take 30 steps when 5 would suffice?). A correct outcome achieved wastefully is a margin problem.
pass@k (probability of at least one success in k attempts) suits internal tools with human review; pass^k (probability of all k trials succeeding) suits customer-facing agents where users expect reliable behaviour every time.
Capability evals (what can the agent do?) start with low pass rates and measure progress. Regression evals (does it still handle what it used to?) should have near-100% pass rates and protect against backsliding.

Stop vibing. Start measuring.

Frequently Asked Questions

How do you eval agents that produce creative or open-ended output?

Model-based graders with calibrated rubrics. Define what "good" looks like along specific dimensions (accuracy, completeness, tone, groundedness in source material) and have an LLM grade against those dimensions. Calibrate the grader by having humans score a sample and adjusting until the model's judgments align with expert consensus. You can't reduce creative output to binary pass/fail, but you can score it along defined quality axes consistently enough to detect regressions.

How many trials should you run per task to account for non-determinism?

Three to five is a practical starting point for most tasks. This gives you enough signal to estimate per-trial success rates without making eval runs prohibitively expensive. For critical tasks where reliability matters most, run more. For exploratory capability evals where you're measuring whether the agent can do something at all, even two trials are informative. The goal is statistical signal, not statistical perfection.

When should evals run: on every commit, nightly, or before release?

On every change that could affect agent behaviour. In practice, this means running the core regression suite on every PR that touches prompts, model configuration, or tool definitions. Run the full suite (including expensive capability evals) nightly or before release. The cost of running evals is measured in inference dollars. The cost of not running them is measured in production failures and customer trust.