Stop Building AI Agents. Start Building SOPs Wrapped in Code.

TL;DR

A 5-step agentic workflow at 95% accuracy per step is only 77% reliable, and the compounding math kills enterprise deployment
The 2026 opportunity isn't general-purpose agents, it's narrow SOPs wrapped in code that target patience-heavy tasks, not judgment-heavy ones
Prove one step works at 99% before you chain two together

In 2025, we were promised a workforce of autonomous digital interns. Instead, we got really impressive prototypes that broke the moment we took them to production.

I built some of those prototypes. I watched them work flawlessly in demos and fail spectacularly with real data. The gap between "this is amazing" and "this is reliable" turned out to be wider than anyone wanted to admit.

So is 2026 going to be the year agents finally take over?

No. And if you're waiting for an agent you can hand a vague objective ("Go increase my market share") and check back in a week, you're going to be waiting a long time. Possibly until 2035.

But that doesn't mean agents are useless. It means we've been building them wrong.

How does the 95% Trap kill agentic AI in production?

This is the math that kills most agentic ambitions. It's simple, and it's brutal.

Imagine you have an agentic workflow with five steps:

Retrieve data
Summarise data
Check against policy
Format report
Email stakeholder

If your model is 95% accurate at each step (which is unusually high for generative AI) you might think you have a 95% reliable system.

You don't.

You have a 0.95 × 0.95 × 0.95 × 0.95 × 0.95 system. That equals 77% reliability.

One in four runs, your "autonomous" agent produces something wrong. Maybe it misreads a data field. Maybe it applies the wrong policy. Maybe it emails the wrong stakeholder a report with a hallucinated number in it.

In an enterprise setting, a 25% failure rate isn't automation. It's a liability. It's the kind of failure rate that gets the project killed and the team's AI budget redirected to "safer" initiatives.

This is why the 2025 agent wave stalled. Not because the models weren't capable. They're astonishingly capable at individual tasks. But when you chain capable steps together without accounting for compounding error, the system-level reliability drops below the threshold where anyone rational would deploy it.

The fix isn't waiting for models to get better. It's building systems that respect the math.

Five-step pipeline with accuracy dropping at each step: 95% to 90% to 86% to 81% to 77%

2026: The Year of the Boring Workflow

We tried to boil the ocean in 2025. We tried to build "General Purpose Employees": agents that could handle ambiguous objectives, navigate complex decision trees, and adapt to novel situations.

That was exciting. It was also premature.

In 2026, we need to stop building agents and start building Standard Operating Procedures wrapped in code. SOPs aren't glamorous, but they're the unit of work that maps to how enterprises actually operate.

An SOP has defined inputs, defined outputs, defined quality criteria, and defined exception handling. It's boring. It's also exactly the kind of constraint that makes AI systems reliable.

The shift is from "build an agent that can think" to "build a workflow that can execute a known procedure faster and more consistently than a human." Different goal. Different architecture. Different success rate. I catalogue the specific agentic AI patterns in the handbook, from single-step workers to manager-worker hierarchies. The practical starting point is an AI jobs map, because you need to name the work before you decide how much autonomy it deserves.

Three rules for building agents that actually work

If you want to get started without burning cash or credibility, these are the lessons that stuck.

Rule 1: Don't replace the human. Remove the drudgery.

Stop looking for tasks that require judgment. Look for tasks that require patience.

Bad agent: "Analyse this P&L and tell me what to cut." This requires strategic context, organisational knowledge, political awareness, and business judgment. An AI can generate an answer. It can't generate a good answer without context it doesn't have.

Good agent: "Compare these 500 invoices against this purchase order list and flag the mismatches." This requires patience, attention to detail, and tolerance for repetitive comparison. A human doing this is bored by invoice 50 and making errors by invoice 200. An AI doesn't get bored.

The pattern is consistent. Tasks with high volume, clear rules, and low ambiguity are where agents deliver immediate, measurable value. Tasks that require reading the room, weighing competing priorities, or making judgment calls under uncertainty are where agents create risk.

Map your team's workflow. Find the patience-intensive tasks. Start there.

Rule 2: The one-step rule

If you're just starting, do not chain ten steps together. Build a one-step agent.

Input → Transformation → Output → Human Review.

That's it. Prove you can get Step 1 to 99% reliability before you attempt Step 2. Prove Step 2 independently. Only then start chaining.

This feels painfully slow. It's also how you avoid the 95% Trap. Each step gets validated independently, with its own eval suite, its own error handling, and its own quality threshold. When you chain validated steps, you know exactly where failures occur and can address them surgically.

The teams I've seen succeed with agentic workflows all followed this pattern. The teams that failed almost universally tried to build the full chain first and debug it as a system. Debugging a five-step agent chain is like debugging spaghetti code: you can't isolate anything. I've written a practical framework for building eval suites that make this validation possible.

Rule 3: Narrow the context

The biggest killer of agents is what I call "variable drift." The more freedom you give the model, the more rope it has to hang itself.

Don't give it the whole internet. Give it one PDF and one specific schema to fill out. Don't give it open-ended instructions. Give it a template with fields to complete. Don't let it decide how to format the output. Define the format explicitly.

Constraint is clarity. Every degree of freedom you remove from the agent's decision space is a failure mode you've eliminated. The best-performing agents I've built are the ones that would look almost disappointingly simple to a demo audience. They do one thing, with one data source, to one output format. And they do it at 99%+ reliability, thousands of times a day.

That's not exciting. It's profitable. I'll take it.

The industrial phase

We are moving from the hype phase to the industrial phase of AI agents. Industrialisation is boring. It's about measurement, error logging, failure analysis, and rigorous model evals. It's about building monitoring dashboards instead of demo videos. It's about establishing quality thresholds before you deploy, not after something breaks.

But boring and profitable beats exciting and broken every day of the week. The organisations that embrace this mindset in 2026, treating agent development as an engineering discipline rather than an innovation showcase, are the ones that will actually ship.

The agents are coming. They're just going to look a lot less like digital employees and a lot more like really good automated SOPs. And that's fine. That's where the value is. The next step is scaling this into multi-agent hierarchies, but only after you've proven each step independently.

Frequently Asked Questions

If agents are unreliable, why not just use traditional automation?

Traditional automation (RPA, scripts, rules engines) works well for fully deterministic processes. Agents add value where the task has some variability (slightly different document formats, natural language inputs, fuzzy matching) but still follows a known procedure. The sweet spot for agents is "structured enough to have clear success criteria, variable enough that hard-coded rules can't handle it." That's a large category of enterprise work.

How do you measure agent reliability in production?

Treat it like any other system. Define your success criteria before deployment: what does a correct output look like? Build automated checks where possible (schema validation, business rule checks) and sample-based human review where automated checks aren't feasible. Track accuracy per step, not just at the system level. The 95% Trap means system-level metrics hide step-level problems.

When will general-purpose agents actually work?

The honest answer is: not soon. General-purpose agency requires reasoning, planning, error recovery, and contextual judgment that current models approximate but don't reliably deliver. We'll get there incrementally. Each year, agents will handle slightly more complex tasks with slightly more autonomy. But the "give it a goal and walk away" vision is years out. Build for what works now, not what might work eventually.