The fastest way to cut your LLM bill in half is to stop routing every query through your most expensive model. Most enterprise AI pipelines use a single frontier model for everything — classification, extraction, summarization, generation — despite the fact that 60–80% of those tasks could be handled by a model costing 10–100× less.
LLM model routing is the technique that fixes this. A lightweight router inspects each incoming query, estimates its complexity, and sends it to the cheapest model capable of answering correctly. Complex or high-stakes queries go to frontier models. Routine tasks go to smaller, cheaper alternatives. Quality stays constant. Costs drop immediately.
Research from Stanford puts the waste at 50–90% of total inference spend (Chen et al., FrugalGPT, 2023). Model routing is the most direct way to capture it.
Definition
LLM model routing is the practice of automatically directing each query to the cheapest model capable of answering it correctly. A router classifies each request by difficulty or task type — before or during inference — and maps it to the optimal model from a portfolio. The goal is to pay only for the intelligence each task actually requires.
Why Routing Works: The Capability Gap Is Real
Frontier models are overqualified for most production tasks. Not every query needs the reasoning depth of GPT-4o or Claude Opus — and for tasks where a smaller model performs identically, using a frontier model is pure waste.
Consider what a typical enterprise AI pipeline actually processes:
- Intent classification — what category does this request fall into? (binary or N-way)
- Entity extraction — pull names, dates, amounts from unstructured text
- Template-based summarization — condense a document into a fixed structure
- Structured JSON generation — convert natural language into a schema
- FAQ and policy lookup responses — answer from a fixed knowledge base
- Complex reasoning and generation — nuanced responses, multi-step problems
The first five tasks are commodity inference. A 7B or 13B fine-tuned model handles them at equal quality on most workloads. Only the last — complex reasoning and free-form generation where quality variance matters — benefits from frontier intelligence. Routing identifies which category each query falls into and prices accordingly.
Three Routing Architectures
There are three proven approaches to LLM routing. Each makes different tradeoffs between implementation complexity, latency, and accuracy.
Confidence-Based Cascading
Try the cheapest model first. If its response confidence is above a threshold, return it. If not, escalate to a more capable model. Repeat up the capability ladder until confidence is met or the frontier model responds.
Advantages: simple to implement, no training data required, naturally handles edge cases. Disadvantages: low-confidence queries incur latency from multiple model calls before escalating.
FrugalGPT (Chen et al., Stanford, 2023) implements a three-model cascade and achieves up to 98% cost reduction on HellaSwag, MMLU, and CommonsenseQA vs. GPT-4 alone. arxiv:2305.05176 ↗
Pre-Inference Classification
Train a lightweight classifier on samples of your production queries, labeled with which model produced acceptable quality. At inference time, classify first — before any LLM call — and route directly to the predicted best model. No wasted inference on queries you'll escalate anyway.
Advantages: single LLM call per query, no latency penalty. Disadvantages: requires labeled training data from your specific workload; classifier accuracy caps at training data quality.
RouteLLM (Murray et al., 2024) trains a matrix factorization and Bradley-Terry classifier on human preference data, achieving 2× cost savings while maintaining 95% of GPT-4 quality. arxiv:2406.18665 ↗
Embedding-Based Routing
Embed each query and compare it against a library of historical queries with known optimal model assignments. Route to the model that worked best for semantically similar past queries. Improves automatically as you accumulate production data.
Advantages: improves over time, captures subtle task-specific patterns, no manual labeling needed beyond initial setup. Disadvantages: higher latency from embedding computation; cold-start problem with new query types.
Demonstrated effective in production deployments where query types are diverse and overlap with historical patterns. Works best when combined with a fallback cascade for novel query types.
What Routing Can and Cannot Replace
Routing is not a universal solution. It works best when your query distribution is predictable and you have sufficient volume on each task type to measure quality empirically. Understanding its limits prevents misconfigured deployments.
Tasks where routing typically captures full savings:
- Classification with 3–20 predefined categories
- Structured data extraction from predictable document formats
- Summarization of short, well-structured documents
- JSON generation with a fixed schema
- Sentiment analysis and intent detection
- Lookup-style responses from a fixed knowledge base
Tasks where frontier quality is harder to replace:
- Multi-step reasoning across ambiguous or contradictory inputs
- Novel creative generation where quality judgments are subjective
- High-stakes decisions with significant downside risk (medical, legal)
- Long-form synthesis across many documents with nuanced judgment
The key question is not "is routing safe?" but "for which specific tasks on my actual workload does a smaller model match frontier quality?" This requires running evals on your data, not assuming benchmark performance translates.
Production Case Studies
Research Foundations
How to Build Your First Router
The fastest path to production routing starts with the cascade approach — no training data required, and it immediately captures savings on your most predictable query types.
- Pick your two-model pair — a cheap model for commodity tasks (e.g., Llama-3-8B or Claude Haiku) and your current frontier model as fallback
- Define your quality signal — how do you know a response is "good enough"? Options: logprob confidence, a lightweight verifier model, human review on a sample, or a downstream metric (e.g., downstream click rate, user satisfaction)
- Set your confidence threshold — run the cheap model on 100–500 representative queries, score outputs, and find the threshold where quality matches your frontier model
- Deploy the cascade — call cheap model first; if confidence exceeds threshold, return; otherwise call frontier model
- Monitor and calibrate — track escalation rate (% of queries hitting frontier model), quality signal distribution, and cost per query over time
A well-calibrated cascade typically escalates 15–35% of queries to the frontier model, with the remainder handled by the cheaper alternative. Starting escalation rate above 50% usually indicates the threshold is too conservative; below 5% often means the threshold is too permissive.
"The hardest part of model routing is not the engineering — it's defining what 'good enough' means for each task type on your specific production data. Once you have that, the routing logic is straightforward."
Frequently Asked Questions
What is LLM model routing?
LLM model routing automatically directs each incoming query to the cheapest model capable of answering it correctly. Instead of using one frontier model for everything, a lightweight router classifies the difficulty of each query and sends simple requests to smaller, cheaper models while escalating complex ones to frontier models.
How much cost can LLM model routing save?
Research from RouteLLM (Murray et al., 2024) shows 2× cost reduction while maintaining 95% of GPT-4 quality. FrugalGPT (Chen et al., 2023) demonstrates up to 98% cost reduction with equal or better accuracy on specific benchmarks. In production, most enterprise teams see 40–70% savings after implementing routing.
Does model routing reduce quality?
When calibrated correctly, model routing does not reduce output quality on your specific workload. The key is evaluating on your own query distribution, not benchmark averages. Tasks where smaller models match frontier quality include: classification, entity extraction, structured JSON generation, summarization of short documents, and FAQ responses.
What is the difference between LLM routing and model cascading?
Cascading tries cheap models first and escalates based on confidence — it always starts with the cheapest option. Routing classifies difficulty before any inference and sends the query directly to the right model — no wasted calls. Cascading is easier to implement; routing is more efficient for predictable query types.