The Bitter Lesson Kills Your Orchestration Layer
TL;DR
- Rich Sutton's bitter lesson (general methods outperform engineered solutions as compute scales) applies directly to AI product architecture
- Elaborate orchestration layers that scaffold model behaviour give 10-20% performance gains that typically get wiped out by the next model release
- The winning architecture is minimal: give the model tools, a goal, and constraints, then get out of the way
Rich Sutton published "The Bitter Lesson" in 2019. His argument was simple: across 70 years of AI research, general methods that scale with computation have consistently outperformed domain-specific engineered solutions. Chess engines, speech recognition, computer vision: the teams that built clever heuristics kept losing to the teams that threw more compute at simpler algorithms.
The lesson was bitter because the clever engineering felt right. It represented deep domain knowledge, careful design, and genuine insight. And it kept losing to brute force.
Product builders are learning the same lesson right now with orchestration layers. The engineering feels right. The results are temporary.
The orchestration instinct
When product teams build with LLMs, the instinct is to constrain. The model is unpredictable, so you build scaffolding to make it predictable.
Step-by-step workflows. Mandatory tool sequences. Output validators. Retry chains. Classification layers that route requests to specialised prompts. Manager-worker hierarchies with explicit handoff protocols. I wrote about some of these patterns myself in agentic AI architectures.
These patterns work. A well-designed orchestration layer can improve task completion rates by 10-20% compared to a bare model. For production systems with real users and real consequences, that margin matters.
The problem is durability. Every orchestration layer I've built has had a shelf life of three to six months before a model upgrade made it partially or fully obsolete.
What I saw in production
When I built the AI voice receptionist for OpenChair, I initially designed an explicit conversation flow. The model would greet, ask for the service type, check availability, confirm the booking, and summarise. Each step had a specific prompt. The transitions were managed by application logic.
This worked well with the model I launched on. It worked less well two months later when a new model version handled the full conversation naturally without needing the step-by-step scaffolding. The explicit transitions were now introducing unnecessary pauses and occasionally conflicting with the model's own conversation management. I spent a day ripping out the orchestration layer and replacing it with a single system prompt that described the goal and constraints.
The simplified version performed better. Fewer dropped calls. More natural conversations. Lower latency. Less code to maintain.
Same pattern with the multi-agent audit architecture. I built a manager-worker system where a manager model reviewed every worker output for quality. That architecture made economic sense when worker models were unreliable. When the worker model improved to the point where it produced correct output 97% of the time, the manager review became a 2,500% cost increase for a 3% quality improvement. The maths stopped working.
The bitter lesson applied: the general model got better. My specific engineering became overhead.
When to scaffold and when to trust
I'm not arguing that all orchestration is waste. Some scaffolding is durable because it serves purposes that model improvement doesn't address:
Durable scaffolding:
- Guardrails and safety constraints. Preventing the model from taking destructive actions, accessing sensitive data without authorisation, or operating outside compliance boundaries. These exist because of governance requirements, not model limitations.
- Observability and audit trails. Logging tool calls, recording decisions, maintaining trace data for debugging and compliance. The model getting smarter doesn't eliminate the need for accountability.
- Cost controls. Token budgets, request rate limiting, model routing based on task complexity. These exist because of economic constraints, not capability gaps.
- Human-in-the-loop checkpoints. Review gates at high-consequence decision points. These exist because of risk tolerance, not model quality.
Temporary scaffolding (the kind the bitter lesson kills):
- Step-by-step workflow enforcement. ("You must do A, then B, then C.")
- Output format validation and retry loops. ("If the JSON is malformed, re-prompt with stricter instructions.")
- Task decomposition logic. ("Break this task into subtasks and solve each separately.")
- Classification layers that route to specialised prompts. ("If the query is about billing, use prompt X; if about support, use prompt Y.")
The test is straightforward: does this scaffolding exist because of a model limitation or because of a business requirement? If it's a model limitation, assume the model will fix it. Build the scaffolding if you need it today, but design it to be removable.
The SOP resolution
I wrote earlier about building SOPs wrapped in code as the path to reliable agentic AI. That argument still holds, but the bitter lesson refines it.
SOPs should define outcomes and constraints, not procedures.
A good SOP says: "When a customer calls to book an appointment, confirm the service, check availability, book the slot, and send confirmation. Never double-book. Always confirm the customer's phone number." That SOP is durable because it describes what the agent should accomplish and what it must not do.
A bad SOP says: "Step 1: Greet the customer with 'Hello, how can I help you today?' Step 2: Ask 'What service would you like to book?' Step 3: Query the availability API with parameters..." That SOP is temporary because it prescribes how the agent should work at a level of detail that the model will eventually handle better on its own.
The resolution: define the goal, the tools, and the guardrails. Let the model figure out the procedure. As the model improves, the procedure improves automatically. Your product gets better without a code change.
The practical architecture
If you're designing an AI product architecture today with the bitter lesson in mind, here's the structure:
Goal (what the model should accomplish)
↓
Tools (what capabilities the model has access to)
↓
Constraints (what the model must not do)
↓
Evals (how you verify the model did it correctly)
That's it. No workflow engine. No state machine. No orchestration layer. The model receives a goal, has access to tools, operates within constraints, and its output is verified by evals.
When the model is good enough, this architecture produces excellent results with minimal engineering overhead. When the model isn't good enough yet, you add temporary scaffolding at the point of failure, knowing that you'll remove it when the model catches up.
The teams that build this way ship faster, maintain less code, and automatically benefit from model improvements. The teams that build elaborate orchestration layers ship slower, maintain more code, and find that each model upgrade requires rearchitecting their scaffolding.
The bitter lesson isn't just about AI research. It's about AI product engineering. General beats specific. Simple beats clever. And the model always catches up.
Frequently Asked Questions
Doesn't this approach sacrifice reliability for simplicity?
In the short term, sometimes yes. A well-tuned orchestration layer can outperform a bare model for specific tasks. The question is whether that performance gap is durable. If the next model closes the gap (and historically, it does), the orchestration layer becomes a liability. Use evals to measure the gap continuously and remove scaffolding as the model graduates.
How does this apply to regulated environments where consistency matters?
In regulated environments, the durable scaffolding (guardrails, audit trails, human review gates) matters more, not less. What changes is the temporary scaffolding. A bank might require that every AI-generated valuation is logged and reviewable. That's a governance requirement and it's permanent. But the bank shouldn't hard-code the valuation calculation procedure, because the model's ability to perform that calculation will improve.
What if my team has already built an elaborate orchestration layer?
Don't rip it out overnight. Instead, add monitoring that tracks whether each piece of scaffolding is still improving outcomes. When a component stops adding value (because the model has caught up), remove it. Treat scaffolding like feature flags: designed to be turned off.
Logan Lincoln
Product executive and AI builder based in Brisbane, Australia. Nine years in regulated B2B SaaS, currently shipping production AI platforms.