5 articles tagged Evals

When AI writes the code, green CI isn't enough. The new discipline is understanding and defending the choices the model made — not just the ones you made.

DAU, time-in-app, and NPS were built for a world where humans do the work. AI products need different metrics. A framework for what to measure and why.

Most teams evaluate agents with manual chats and gut feel. A practical framework for eval suites that let you ship, starting with 20 examples, not 20,000.

A 5-step agent at 95% accuracy per step is only 77% reliable. The path forward isn't better agents, it's narrower ones. Three rules for workflows that ship.

Building for a single model is technical debt with a short shelf life. The winning strategy is orchestration, evals, and governance, not leaderboard loyalty.