Tag
Evals
5 articles tagged with Evals.

Tests Pass. Does It Think?
When AI writes the code, green CI isn't enough. The new discipline is understanding and defending the choices the model made — not just the ones you made.

How to Measure an AI Product (When Traditional Metrics Lie)
DAU, time-in-app, and NPS were built for a world where humans do the work. AI products need different metrics. A framework for what to measure and why.

Your Agent Evals Are Vibes. Here's How to Make Them Infrastructure.
Most teams evaluate agents with manual chats and gut feel. A practical framework for eval suites that let you ship, starting with 20 examples, not 20,000.

Stop Building AI Agents. Start Building SOPs Wrapped in Code.
A 5-step agent at 95% accuracy per step is only 77% reliable. The path forward isn't better agents, it's narrower ones. Three rules for workflows that ship.

Stop Picking Winners in the Model Race. Build the Router Instead.
Building for a single model is technical debt with a short shelf life. The winning strategy is orchestration, evals, and governance, not leaderboard loyalty.