Evaluation Frameworks as Product Infrastructure
Why evals are day-one infrastructure, how to build them from 20 examples, and why your eval suite is your competitive advantage.
TL;DR
- Evals are day-one infrastructure, not post-launch monitoring. If you can't measure quality, you can't iterate.
- Start with 20–50 examples drawn from real production failures, not 20,000 synthetic cases. Grow them from seed examples into capability benchmarks and then a regression suite.
- For agents, grade the path and the destination. Trace grading catches loops, wasted tool calls, and cost blow-outs even when the final output is correct.
Most AI teams treat evaluation as a quality phase that happens after the product works. Write the prompts, build the tooling, test it manually, ship it, then maybe add some evals when things start breaking.
This gets it exactly backwards. Evals are the infrastructure that makes development possible. Without them, every prompt change is a gamble. Every model swap is a leap of faith. Every new feature is a potential regression you won't discover until a customer reports it.
The eval suite is not a test harness. It's the product's acceptance criteria, the shared definition of quality that product, engineering, and leadership can point to. A feature isn't done until it passes its evals.
Why evals come first
Traditional software has a well-understood testing story. Unit tests, integration tests, end-to-end tests, CI/CD pipelines. AI products inherit none of this for free.
A model can be technically healthy (low latency, no errors, serving responses) while producing increasingly wrong answers. A prompt change that improves one capability can silently degrade four others. A model upgrade that scores better on public benchmarks can perform worse on your specific use cases.
You cannot observe these problems through logs and dashboards. You need structured evaluation that runs automatically, compares against known-good baselines, and surfaces regressions before they reach users.
Three reasons evals must be day-one work:
The most volatile phase is the least measured. Early development produces the largest behavioural swings. Each prompt iteration, each tool change, each model swap shifts output dramatically. This is exactly when measurement matters most, and exactly when most teams rely on manual spot checks.
Measurement shapes the work. Teams that build evals early naturally write more precise requirements. Instead of "the agent should handle refunds well," they write "the agent should process a standard refund in under 5 tool calls with 95%+ accuracy." The eval forces specificity.
Retroactive eval suites are worse. Building evals after launch means reconstructing test cases from memory and imagination rather than capturing them as they happen. The resulting suite reflects what you remember failing, not what actually failed.
Building the eval suite
The growth path for an eval suite has three stages, each building on the last.
Stage 1: Seed examples (20–50 cases)
Your first eval examples come from three places:
Production failures. Every bug report, every support ticket about wrong agent behaviour, every "that's not right" from a manual check is a test case. The input that caused the failure is your test input. The correct behaviour is your expected outcome.
Pre-release rituals. Every team has things they manually verify before shipping. "Does it still handle cancellations?" "Does it pick the right tool for address lookups?" Script these checks. Automate the things you're already doing by hand.
Edge cases from domain experts. Your subject matter experts know the tricky inputs, the ambiguous requests, the scenarios where the system should refuse rather than guess. Capture ten of these. They'll catch problems that happy-path tests never will.
Twenty well-chosen examples beat two thousand synthetic ones. Synthetic data reflects your assumptions about what might go wrong. Production data reflects what actually goes wrong.
Stage 2: Capability benchmarks
As the seed suite stabilises, add capability evals that target what the agent struggles with. These start with low pass rates and give your team a hill to climb.
If your agent handles straightforward queries at 90% but complex multi-step tasks at 30%, the capability eval focuses on those multi-step tasks. You're measuring progress toward a specific goal.
Capability evals use human-graded gold standards. A domain expert reviews the expected outputs and confirms they represent genuinely correct behaviour. Without this calibration step, you risk optimising toward outputs that pass automated checks but fail in practice.
Stage 3: Regression suite
As capability evals improve and pass rates cross your quality threshold, those tasks graduate into the regression suite. What once measured "can we do this at all?" now measures "can we still do this reliably?"
Regression evals are binary pass/fail on known-good examples. They run on every change that could affect agent behaviour: prompt edits, model swaps, tool modifications, configuration changes. A declining score on a regression eval is an immediate stop-and-investigate signal.
The lifecycle is natural: seed examples catch the failures that already hurt you, capability evals push toward better performance, and regression evals lock in those gains.
Grading path and destination
For a chatbot, the final answer is what matters. For an agent, the execution trace is just as important.
Consider a support agent that successfully processes a refund. The outcome is correct. But the trace reveals that the agent called the identity verification tool three times unnecessarily, re-read a policy document it had already retrieved, used an expensive reasoning model to format a confirmation email, and took 15 turns to complete a 5-turn task.
The outcome passed. The execution was terrible. In production, that waste translates directly into inference cost, latency, and user frustration.
Outcome grading
Checks whether the end state is correct. Did the refund process? Did the code compile? Did the report contain the required sections? Binary, verifiable, essential.
Outcome graders come in three flavours:
- Deterministic graders check pattern matches, schema conformance, compilation, format validation. Fast, cheap, and objective.
- Model-based graders use a language model to judge output against a rubric ("Was the explanation grounded in retrieved documents?"). More expensive and non-deterministic, but they capture nuance.
- Human graders are the gold standard for calibration and genuinely novel scenarios. Expensive and slow, so use them strategically.
Trace grading
Checks how the agent got there. Trace grading evaluates the full execution path against efficiency and correctness criteria:
- Turn count. Did the agent complete the task in a reasonable number of steps?
- Tool selection. Did it use the right tools? Did it call expensive tools when cheap ones would suffice?
- Loop detection. Did it revisit the same tool or re-read the same document unnecessarily?
- Backtracking. Did it abandon a correct approach and try an inferior one?
- Escalation timing. When confidence was low, did it escalate promptly or muddle through?
Both dimensions matter. A correct outcome achieved wastefully is a margin problem. An efficient trace that produces the wrong outcome is a quality problem. Measure both.
Eval dimensions beyond accuracy
Accuracy is necessary but insufficient. A production eval suite measures across multiple dimensions.
| Dimension | What it measures | Why it matters |
|---|---|---|
| Outcome accuracy | Is the final output correct? | The baseline quality gate. |
| Tool call efficiency | Number of tool calls vs. optimal path | Excess calls burn tokens and add latency. |
| Cost per task | Total inference spend for the task | Directly affects unit economics. |
| Latency | Wall-clock time from input to output | User experience and SLA compliance. |
| Loop detection | Repeated tool calls or circular reasoning | Loops are the most common agent failure mode. |
| Context utilisation | How well the agent uses available context | Poor utilisation leads to redundant retrieval. |
| Escalation rate | Frequency of human handoffs | Too high means the agent isn't useful. Too low means it's over-confident. |
| Cost routing | Expensive model usage vs. cheap model usage | Right-sizing model selection per sub-task. |
Track these dimensions per task type, not just in aggregate. An agent that's fast and cheap on simple queries but slow and expensive on complex ones needs different optimisation than one that's uniformly mediocre.
Continuous model monitoring
Evals catch problems before deployment. Model monitoring catches problems that emerge after deployment, when the real world shifts underneath your system.
Three types of drift demand automated monitoring:
Data drift. The statistical properties of incoming inputs are changing. Users ask about topics, in formats, or at volumes the system wasn't built for. A property valuation model trained on suburban housing data starts receiving commercial property queries. The inputs look different. The model doesn't know it's out of distribution.
Prediction drift. The distribution of model outputs shifts even though inputs look similar. A classification model that used to distribute predictions across five categories starts concentrating on two. Something changed in the model's behaviour, and the cause isn't obvious from inputs alone.
Concept drift. The real-world meaning of the data has changed. What counted as a "high priority" support ticket six months ago has different criteria today. The model's accuracy degrades because reality moved, not because the model changed.
Alerting and response
Drift monitoring must be automated. When any drift metric crosses a predefined threshold, the system should:
- Alert the product team
- Create a retraining or investigation item in the backlog
- Log the drift event with enough context to diagnose the cause
Model retraining is not tech debt. It's a core product activity. Traditional software features are stable until you deliberately change them. AI models, left alone, degrade over time as the world around them shifts.
The HITL feedback loop
Human-in-the-loop (HITL) workflows serve two purposes: quality control and data generation. Most teams understand the first. The second is where the real value sits.
The process
- Route. When the model produces a low-confidence output (or for a random sample), route the task to a human expert instead of serving it directly.
- Refine. The human corrects, validates, or improves the output.
- Feed back. The corrected output becomes a new labelled example, ground truth that feeds directly into your eval suite and, optionally, into fine-tuning datasets.
Why this matters for evals
Every human correction generates a high-quality test case. The input is real production data. The expected output is expert-validated. Over time, the HITL workflow becomes a data generation engine that continuously expands and strengthens your eval suite.
This creates a compounding loop: the eval suite identifies weak spots, HITL workflows generate corrections for those weak spots, the corrections become new eval cases, and the next round of development targets the updated suite. Each cycle makes the product measurably better.
The teams that operationalise this loop build an increasingly deep moat. Their eval suites grow richer with every deployment. Their models improve passively through usage. Competitors starting from scratch face a dataset and eval gap that widens with every month.
Evals as competitive advantage
When a new model drops (and they drop every few weeks) teams without evals face weeks of manual testing. Is the new model safe to adopt? Does it handle our specific use cases? Did it break anything?
Teams with comprehensive eval suites answer these questions in hours. Run the suite against the new model, compare results to the quality bar, and make a data-driven decision. They adopt improvements faster, route tasks to better models sooner, and adjust their multi-model orchestration with confidence.
The advantage extends beyond model adoption:
Faster prompt iteration. Every prompt change runs against the full suite before merging. Engineers experiment freely because the evals catch regressions automatically.
Clearer product decisions. Instead of debating in meetings whether the agent is "good enough," there are numbers. Before-and-after scores replace opinions.
Stronger vendor negotiation. When you can demonstrate that Model A outperforms Model B on your specific use cases with hard data, you negotiate from a position of strength.
Faster onboarding. New engineers understand the product's quality bar immediately by reading the eval suite. The evals document what "correct behaviour" means more precisely than any spec.
What eval-driven PMs look like
| Behaviour | In practice |
|---|---|
| Evals before features | Writes the eval criteria before the first line of feature code. Treats the eval suite as the acceptance criteria. |
| Failure-first thinking | Builds the eval suite from production failures, not imagined scenarios. Converts every bug report into a test case. |
| Multi-dimensional measurement | Tracks cost, latency, and efficiency alongside accuracy. Knows that a correct but expensive output is still a problem. |
| Regression-paranoid | Runs the full regression suite on every change. Investigates score drops immediately, regardless of how small. |
| HITL as data strategy | Designs human review workflows that generate labelled training data, not just quality checks. Sees every correction as an investment. |
The anti-pattern: the vibes-based launch
The team has been building an agent for three months. It works well in demos. The founder tested it last week and liked the results. A few engineers have run it through their favourite test cases manually. Everyone feels good about it.
They ship.
Within a week, support tickets surface failures nobody anticipated. A prompt fix for one issue breaks two others. The team can't tell which changes helped and which hurt, because there's no baseline to compare against. They enter a reactive cycle: patch, deploy, hope, repeat.
Six weeks later, a new model version drops. The team wants to adopt it but can't justify the risk. Manual testing would take two weeks. So they stay on the old model, falling behind competitors who upgraded in a day.
The cost of the vibes-based launch isn't the initial failures. It's the compounding inability to improve. Without evals, every change is a coin flip. With them, every change is a measured experiment. The gap between these two modes of operation widens with every deployment cycle.
Stop vibing. Start measuring.