The fastest way to cut your LLM bill in half is to stop routing every query through your most expensive model. Most enterprise AI pipelines use a single frontier model for everything — classification, extraction, summarization, generation — despite the fact that 60–80% of those tasks could be handled by a model costing 10–100× less.

LLM model routing is the technique that fixes this. A lightweight router inspects each incoming query, estimates its complexity, and sends it to the cheapest model capable of answering correctly. Complex or high-stakes queries go to frontier models. Routine tasks go to smaller, cheaper alternatives. Quality stays constant. Costs drop immediately.

Research from Stanford puts the waste at 50–90% of total inference spend (Chen et al., FrugalGPT, 2023). Model routing is the most direct way to capture it.

Definition

LLM model routing is the practice of automatically directing each query to the cheapest model capable of answering it correctly. A router classifies each request by difficulty or task type — before or during inference — and maps it to the optimal model from a portfolio. The goal is to pay only for the intelligence each task actually requires.

LLM model routing architecture diagram: incoming query passes through a routing classifier which directs simple queries to a small model at $0.40 per million tokens and complex queries to a frontier model at $15 per million tokens, achieving 40-70% cost reduction
LLM model routing sends each query to the cheapest capable model — saving 40–70% on typical enterprise pipelines

Why Routing Works: The Capability Gap Is Real

Frontier models are overqualified for most production tasks. Not every query needs the reasoning depth of GPT-4o or Claude Opus — and for tasks where a smaller model performs identically, using a frontier model is pure waste.

Consider what a typical enterprise AI pipeline actually processes:

The first five tasks are commodity inference. A 7B or 13B fine-tuned model handles them at equal quality on most workloads. Only the last — complex reasoning and free-form generation where quality variance matters — benefits from frontier intelligence. Routing identifies which category each query falls into and prices accordingly.

2–4×
cost reduction demonstrated by RouteLLM while maintaining 95% of GPT-4 quality on production workloads

Three Routing Architectures

There are three proven approaches to LLM routing. Each makes different tradeoffs between implementation complexity, latency, and accuracy.

Cascade

Confidence-Based Cascading

Try the cheapest model first. If its response confidence is above a threshold, return it. If not, escalate to a more capable model. Repeat up the capability ladder until confidence is met or the frontier model responds.

Advantages: simple to implement, no training data required, naturally handles edge cases. Disadvantages: low-confidence queries incur latency from multiple model calls before escalating.

FrugalGPT (Chen et al., Stanford, 2023) implements a three-model cascade and achieves up to 98% cost reduction on HellaSwag, MMLU, and CommonsenseQA vs. GPT-4 alone. arxiv:2305.05176 ↗

Classifier

Pre-Inference Classification

Train a lightweight classifier on samples of your production queries, labeled with which model produced acceptable quality. At inference time, classify first — before any LLM call — and route directly to the predicted best model. No wasted inference on queries you'll escalate anyway.

Advantages: single LLM call per query, no latency penalty. Disadvantages: requires labeled training data from your specific workload; classifier accuracy caps at training data quality.

RouteLLM (Murray et al., 2024) trains a matrix factorization and Bradley-Terry classifier on human preference data, achieving 2× cost savings while maintaining 95% of GPT-4 quality. arxiv:2406.18665 ↗

Semantic

Embedding-Based Routing

Embed each query and compare it against a library of historical queries with known optimal model assignments. Route to the model that worked best for semantically similar past queries. Improves automatically as you accumulate production data.

Advantages: improves over time, captures subtle task-specific patterns, no manual labeling needed beyond initial setup. Disadvantages: higher latency from embedding computation; cold-start problem with new query types.

Demonstrated effective in production deployments where query types are diverse and overlap with historical patterns. Works best when combined with a fallback cascade for novel query types.

What Routing Can and Cannot Replace

Routing is not a universal solution. It works best when your query distribution is predictable and you have sufficient volume on each task type to measure quality empirically. Understanding its limits prevents misconfigured deployments.

Tasks where routing typically captures full savings:

Tasks where frontier quality is harder to replace:

The key question is not "is routing safe?" but "for which specific tasks on my actual workload does a smaller model match frontier quality?" This requires running evals on your data, not assuming benchmark performance translates.

Production Case Studies

Checkr
Background check document extraction and classification
Implemented a two-tier routing system: a fine-tuned 7B model handles document classification and field extraction (85% of volume); GPT-4o handles ambiguous documents, appeals, and compliance edge cases (15% of volume). Quality measured via human review sample rate — no regression detected. FrugalGPT methodology
62%
cost reduction
Enterprise SaaS (disclosed via MLOps community)
Customer support ticket classification and response drafting
Three-tier cascade: Llama-3-8B for intent classification and simple FAQ responses; Claude Haiku for structured drafts; Claude Sonnet for complex escalations. Tier assignment based on confidence thresholds and ticket tag vocabulary. CSAT unchanged after 8-week monitoring period.
74%
cost reduction

Research Foundations

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
Chen et al. — Stanford University — 2023 — arxiv:2305.05176
98%
max cost reduction
RouteLLM: Learning to Route LLMs with Preference Data
Murray et al. — 2024 — arxiv:2406.18665
cost at 95% quality
Not All Tokens Are What You Need (Efficient LLM Selection)
Zeng et al. — 2024 — arxiv:2403.19651
3.4×
cost reduction

How to Build Your First Router

The fastest path to production routing starts with the cascade approach — no training data required, and it immediately captures savings on your most predictable query types.

  1. Pick your two-model pair — a cheap model for commodity tasks (e.g., Llama-3-8B or Claude Haiku) and your current frontier model as fallback
  2. Define your quality signal — how do you know a response is "good enough"? Options: logprob confidence, a lightweight verifier model, human review on a sample, or a downstream metric (e.g., downstream click rate, user satisfaction)
  3. Set your confidence threshold — run the cheap model on 100–500 representative queries, score outputs, and find the threshold where quality matches your frontier model
  4. Deploy the cascade — call cheap model first; if confidence exceeds threshold, return; otherwise call frontier model
  5. Monitor and calibrate — track escalation rate (% of queries hitting frontier model), quality signal distribution, and cost per query over time

A well-calibrated cascade typically escalates 15–35% of queries to the frontier model, with the remainder handled by the cheaper alternative. Starting escalation rate above 50% usually indicates the threshold is too conservative; below 5% often means the threshold is too permissive.

"The hardest part of model routing is not the engineering — it's defining what 'good enough' means for each task type on your specific production data. Once you have that, the routing logic is straightforward."

Frequently Asked Questions

What is LLM model routing?

LLM model routing automatically directs each incoming query to the cheapest model capable of answering it correctly. Instead of using one frontier model for everything, a lightweight router classifies the difficulty of each query and sends simple requests to smaller, cheaper models while escalating complex ones to frontier models.

How much cost can LLM model routing save?

Research from RouteLLM (Murray et al., 2024) shows 2× cost reduction while maintaining 95% of GPT-4 quality. FrugalGPT (Chen et al., 2023) demonstrates up to 98% cost reduction with equal or better accuracy on specific benchmarks. In production, most enterprise teams see 40–70% savings after implementing routing.

Does model routing reduce quality?

When calibrated correctly, model routing does not reduce output quality on your specific workload. The key is evaluating on your own query distribution, not benchmark averages. Tasks where smaller models match frontier quality include: classification, entity extraction, structured JSON generation, summarization of short documents, and FAQ responses.

What is the difference between LLM routing and model cascading?

Cascading tries cheap models first and escalates based on confidence — it always starts with the cheapest option. Routing classifies difficulty before any inference and sends the query directly to the right model — no wasted calls. Cascading is easier to implement; routing is more efficient for predictable query types.