Multi-Model Orchestration and the Routing Layer
Why no single model wins every task, how the routing layer becomes your competitive advantage, and the worker-manager pattern for multi-agent systems.
TL;DR
- The gap between top models is narrowing on broad benchmarks, but specialisations are diverging. Hard-coding to a single provider is technical debt that compounds with every new release.
- Your routing layer and eval framework are your IP. Model providers ship better models. They can't ship a better understanding of your users' needs.
- Small cheap models for execution, large reasoning models for oversight. The worker-manager pattern can cut inference costs by 75% or more when paired with selective auditing.
The model race looks like convergence. GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Llama 4, DeepSeek V3, Mistral Large: all within a few points of each other on MMLU, HumanEval, and the broad reasoning benchmarks. If you squint at the leaderboards, they're interchangeable.
They're not.
Each model family carries distinct strengths. Claude Opus 4.6 excels at sustained agentic work, long-context synthesis, and complex multi-step reasoning across entire workflows. GPT-5.4 unifies coding and general reasoning with a 1M+ token context window and strong tool orchestration. Gemini 3.1 Pro handles massive multimodal context across text, image, video, and audio with improved agentic reliability. Llama and DeepSeek offer self-hosting economics that closed models can't match. Mistral ships models tuned for European language coverage and regulatory environments.
The product question is not "which model is best." It's "which model is best for this specific task, at this cost, at this latency, right now."
The commoditisation thesis
Model capabilities are commoditising. Prices are falling 10x per year. Last year's frontier model is this year's mid-tier offering, and next year it'll be the cheap option.
This creates a strategic reality that most product teams get wrong: the model is not your moat. If your product's value depends on access to a specific model, you have no defensible advantage. Every competitor has the same access, at the same price, through the same API.
What doesn't commoditise: your understanding of your users' tasks, your eval suite that measures quality on those tasks, and your routing logic that matches tasks to models. These are proprietary. They improve with usage. They compound over time.
The teams that win are not the ones with the best model. They're the ones with the best system for selecting, combining, and evaluating models against their specific workloads.
The routing layer
The routing layer sits between your application and your model providers. It inspects each request, classifies the task, and routes it to the optimal model based on four dimensions: quality requirements, latency constraints, cost budget, and capability match.
How it works
A request arrives. The router classifies it. Classification can be rule-based (regex, keyword matching, task type headers), model-based (a small classifier that predicts the best model for the input), or hybrid.
Once classified, the router selects a model. Selection logic typically follows a priority cascade:
- Does this task type have a hard requirement? (Some tasks need a specific model for compliance, context window, or multimodal capability.)
- Which models meet the latency SLA for this request?
- Among qualifying models, which has the best eval score for this task type?
- Among top performers, which is cheapest?
The router also handles failover. If the primary model is down or slow, it reroutes to the next-best option. If a response fails quality checks, it can retry with a more capable model. This cascade is invisible to the user.
What makes the routing layer IP
Three things:
Task taxonomy. Your classification of what types of work your system handles, and how each type maps to model strengths. This comes from observing real usage, not from guessing. A customer support product might distinguish between "simple lookup," "policy interpretation," "emotional de-escalation," and "technical troubleshooting," each routing to a different model or configuration.
Quality baselines per route. Your eval suite, broken down by task type, tells you which model performs best for each route. These baselines shift as models improve. The routing layer should re-evaluate periodically as new models and versions drop.
Cost and latency profiles. Real-world performance data, not provider benchmarks, for each model on your actual workload. Provider-quoted latency and token costs rarely match production behaviour under load.
No model provider can replicate this. They don't know your task taxonomy. They don't have your eval data. They don't understand your users' tolerance for latency versus quality tradeoffs. This is your competitive advantage.
The worker-manager pattern
Multi-model orchestration is not just about routing different requests to different models. It's also about using multiple models within a single workflow.
The worker-manager pattern assigns roles by cost and capability:
Workers are small, cheap, fast models. They handle execution tasks: formatting output, extracting structured data from text, simple classification, template filling, data validation. These tasks have clear right answers and don't require complex reasoning. Workers run at a fraction of the cost of frontier models, often 10-20x cheaper per token.
Managers are large reasoning models. They handle planning, complex judgment, ambiguity resolution, and quality oversight. The manager decomposes a complex task into worker-sized subtasks, dispatches them, and reviews the results.
The audit tax
The naive implementation has the manager review every worker output. This doubles the inference cost at minimum. For a workflow with five parallel workers, the manager's audit pass can inflate total cost by 2,500% over the base worker cost alone.
The fix is selective auditing. Not every worker output needs manager review.
Confidence-based routing. Workers report a confidence score alongside their output. High-confidence outputs (above a calibrated threshold) pass directly to the next stage. Low-confidence outputs get routed to the manager for review.
Statistical sampling. Even for high-confidence outputs, the manager reviews a random sample to catch systematic errors the confidence score might miss.
Graduated trust. New workers or new task types start with 100% audit. As reliability is demonstrated through evals, the audit rate decreases. If quality dips, the audit rate increases automatically.
At 80% high-confidence pass-through with 10% sampling on the remainder, the blended audit cost drops by roughly 75% compared to full review. The quality impact is negligible if your confidence calibration is sound.
MCP and tool ecosystems
Model Context Protocol (MCP) standardises how models connect to external tools and data sources. Instead of building custom integrations for each model provider, you expose capabilities through MCP servers that any compliant model can call.
This matters for multi-model orchestration because it decouples tool access from model selection. Your routing layer can swap models without rebuilding integrations. A task that ran on Claude yesterday can run on GPT-5.4 today using the same MCP servers.
PM decisions for MCP
Which integrations to expose. Every MCP server you stand up is attack surface, maintenance burden, and a potential source of latency. Expose what the product needs. Resist the temptation to connect everything because you can.
Data source access controls. MCP servers inherit whatever permissions you grant them. A model with access to your customer database through MCP has access to your customer database. Design permissions as carefully as you would for a new hire, because the blast radius of a hallucinated tool call is bounded by the permissions you set.
Tool descriptions. Models select tools based on their descriptions. Poorly described tools get misused. Invest time in writing precise, unambiguous tool descriptions with clear parameter documentation. This is prompt engineering for tool selection, and it matters as much as the system prompt.
Versioning and deprecation. When you change a tool's behaviour, every model that uses it is affected. Version your MCP servers. Run old and new versions in parallel during transitions. Monitor tool call patterns for unexpected changes after updates.
Cost engineering
Multi-model orchestration creates cost levers that single-model architectures lack. Three techniques matter most.
Prompt caching
Most model providers now support prompt caching: the system prompt and any repeated context prefix are cached and reused across requests. For applications with long system prompts, retrieval-augmented context, or shared conversation prefixes, this reduces input token costs by up to 90%.
The PM implication: prompt caching rewards stable, front-loaded context. If your system prompt changes on every request, caching provides no benefit. If you can structure your prompts so that the first 80% is identical across requests and the variable portion sits at the end, caching saves real money at scale.
Design your prompt architecture with caching in mind. It's not just an engineering optimisation. It's a product decision about how context is structured.
Speculative decoding
Speculative decoding uses a small, fast model to generate draft tokens, then a large model to verify them in a single forward pass. The large model accepts correct tokens (which is most of them for straightforward text) and only regenerates where the draft was wrong.
The result: near-large-model quality at near-small-model latency. The cost profile is more nuanced, since you're running two models, but the latency improvement can be dramatic for user-facing features where time-to-first-token matters.
Not every provider exposes speculative decoding controls. Where they do, it's most effective for tasks with predictable output patterns: structured data, templated responses, code generation with established conventions.
Batching and async processing
Not every task needs a real-time response. Background processing, nightly batch jobs, and async workflows can use batch APIs at 50% or greater discounts. The tradeoff is latency (batch requests may take hours), but for data enrichment, content generation pipelines, and offline analysis, the economics are compelling.
The PM decision is classification: which tasks are genuinely latency-sensitive, and which just feel that way because nobody questioned the default? Moving 30% of your inference volume to batch processing can halve your model spend with zero user-facing impact.
When to orchestrate and when not to
Multi-model orchestration adds complexity. Complexity has costs: more failure modes, harder debugging, more infrastructure to maintain, more vendor relationships to manage. Don't orchestrate for its own sake.
Orchestrate when
- Task diversity is high. Your product handles fundamentally different types of work (creative writing, code generation, data extraction, image analysis) that map to different model strengths.
- Cost pressure is real. You're spending enough on inference that the engineering investment in routing pays for itself through cheaper model allocation.
- Reliability requirements demand failover. A single-provider outage cannot be allowed to take your product offline.
- You have the eval infrastructure. Multi-model routing without evals is guessing. You need per-task-type quality measurement to know whether your routing decisions are correct.
Stay single-model when
- Your product does one thing. A focused product with a single task type gains little from orchestration overhead.
- Volume is low. Below a certain inference spend, the engineering cost of routing exceeds the savings from cheaper models.
- You can't measure quality per route. Without evals broken down by task type and model, you can't validate that your routing is improving outcomes. You'll just be adding complexity without evidence it helps.
- The integration burden outweighs the benefit. Different models have different prompt formats, tool calling conventions, output structures, and failure modes. Supporting multiple models means maintaining multiple prompt sets and handling multiple edge cases.
The decision framework is straightforward: will the total cost of building and maintaining multi-model orchestration (engineering time, infrastructure, testing, prompt maintenance) be less than the savings and quality gains it delivers? If the answer isn't clearly yes, wait. Single-model simplicity is underrated.
What multi-model PMs look like
| Behaviour | In practice |
|---|---|
| Provider-agnostic by default | Abstracts model calls behind a routing layer from the start. Never hard-codes provider-specific features without a documented exit path. |
| Cost-modelling before building | Models the inference economics of single-model vs. multi-model before committing to architecture. Knows the break-even point. |
| Eval-driven routing | Routes based on measured per-task quality, not intuition or marketing claims. Re-evaluates routing when new models drop. |
| Negotiation-ready | Uses multi-provider optionality as negotiation leverage. Can credibly switch providers because the architecture supports it. |
The anti-pattern: the single-model bet
The team builds everything on one provider. The system prompt is optimised for that model's quirks. The tool calling uses provider-specific schemas. The output parsing assumes provider-specific formatting. The cost model assumes current pricing will hold.
Then the provider raises prices by 40%. Or suffers a week of degraded performance. Or a competitor releases a model that's twice as fast at half the cost for the team's primary use case. Or the provider deprecates the model version the team depends on with 90 days' notice.
The team can't switch. Migrating means rewriting prompts, retesting every workflow, rebuilding tool integrations, and re-validating quality. It takes months. During those months, competitors who built provider-agnostic architectures have already moved.
The single-model bet feels simpler on day one. By month six, it's the most expensive decision the team made. Build the abstraction layer early. It costs a week of engineering. Skipping it costs quarters of lock-in.