How much cost does prompt compression save?

Prompt compression saves on input token costs directly, which can be significant for workloads with long system prompts, RAG context, or few-shot examples. LLMLingua achieves 2–20× compression with under 2% quality degradation. At 4× average compression on input tokens, a workload spending $10,000/month on inputs would save $7,500/month with no model switching required.

What types of prompts compress well?

Prompts that compress well include: RAG-retrieved context (often 60–80% redundant relative to the query), long system prompts with repetitive instructions, few-shot examples with verbose formatting, and document summarization inputs. Prompts that compress poorly include: short queries (nothing to remove), code with precise syntax, and mathematical problems where every token matters.

Prompt Compression: Cut LLM Token Costs 2

Q: What is prompt compression?

Prompt compression is a technique that removes redundant, low-information tokens from LLM inputs before sending them to the model. By stripping filler words, repeated context, and low-importance tokens, it reduces the token count of inputs by 2–20× while preserving the information the model needs to produce a correct response.

Q: How does LLMLingua work?

LLMLingua uses a small language model (typically Llama or Falcon) to score the perplexity of each token in the prompt. Low-perplexity tokens — those that are predictable and redundant — are dropped. High-perplexity tokens — those that carry essential information — are preserved. The result is a compressed prompt that retains the key information at a fraction of the token count.

Q: Does prompt compression work with all LLMs?

Yes — prompt compression is model-agnostic. It runs before the prompt reaches the target LLM, so it works with GPT-4, Claude, Gemini, or any other model. The compressor is a separate small model (used only for token scoring), not the same model handling inference.

Q: What is the difference between prompt compression and prompt caching?

Prompt caching (available in Claude and GPT-4o) stores a processed version of a repeated prefix to avoid re-processing it on each call — it reduces compute cost on repeated context. Prompt compression reduces the token count of the prompt itself before it reaches the model — it reduces both input token cost and the compute needed to process the context. They are complementary: compress first, then cache the compressed prefix.

Every token in your LLM prompt costs money. Most of them don't need to be there. Prompt compression removes redundant tokens from inputs before they reach the model — cutting costs on input tokens without switching models, changing output quality, or touching your application logic.

The technique works because natural language is deeply redundant. A RAG retrieval that returns five 500-word passages to answer a simple question is sending thousands of tokens the model doesn't need. A system prompt written conversationally contains filler, repetition, and padding that carries no information. Prompt compression identifies which tokens are doing work and drops the rest. It is one of the five core techniques in our guide to LLM cost optimization.

LeanLM (not affiliated with Google’s LearnLM educational AI) is an LLM cost optimization platform — this post covers prompt compression techniques for production workloads.

Research from Microsoft shows compression ratios of 2–20× with under 2% quality degradation on benchmarks including CoQA, HotpotQA, and TriviaQA (Jiang et al., LLMLingua, 2023). At 5× compression on a workload spending $20,000/month on input tokens, that's $16,000 saved monthly — with no model changes.

Definition

Prompt compression is the removal of redundant, low-information tokens from LLM inputs before inference. A compression model scores each token's informativeness and drops those below a threshold — preserving the tokens that carry essential meaning while reducing overall token count. The compressed prompt produces equivalent output from the target LLM at a fraction of the input cost.

Prompt compression before-and-after diagram: an original prompt of 847 tokens costing $0.013 per call is reduced by LLMLingua to 94 tokens costing $0.0014 per call — a 9× compression with under 2% quality degradation — LLMLingua cuts this prompt from 847 to 94 tokens — a 9× reduction that drops input cost by the same factor with under 2% quality loss.

Why Prompts Are Redundant

Most production LLM inputs are longer than they need to be. This isn't a design flaw — it's the natural consequence of how LLM applications are built:

RAG retrieval over-fetches context. Retrieval systems return top-k passages to ensure coverage. But retrieved passages are rarely 100% relevant to the query — they're selected by semantic similarity, not information density. A typical RAG retrieval for a focused question returns 2,000–5,000 tokens of context where 400–800 tokens contain the actual answer.

System prompts are written like documentation. Engineers write system prompts for clarity and correctness, not token efficiency. Explanations, caveats, examples, and formatting guidance are all valuable for prompt engineering — and all more verbose than the minimum required by the model.

Few-shot examples are padded. Few-shot examples are chosen to be representative and readable. Representative examples include common phrasing patterns, which means they contain natural language redundancy by design.

Conversation history accumulates. Multi-turn applications send the entire conversation history with each request. Early turns in a conversation are often less relevant to the current question than recent turns — but they're included anyway for context completeness.

20×

maximum compression ratio demonstrated by LLMLingua-2 on RAG workloads with under 2% quality degradation

How Prompt Compression Works

The dominant approach, pioneered by LLMLingua, uses a small auxiliary language model to score the informativeness of each token in the prompt, then drops the least informative ones.

Token-level

Perplexity-Based Token Dropping (LLMLingua)

A small model (Llama-2-7B or similar) computes the perplexity of each token — how surprising it is given its context. Tokens with low perplexity (predictable, redundant) are dropped. Tokens with high perplexity (unexpected, information-dense) are preserved. Compression ratio is controlled by a budget parameter.

LLMLingua (Jiang et al., Microsoft, 2023): 2–5× compression on instruction prompts; 5–20× on RAG context. Under 2% quality drop on CoQA, HotpotQA, TriviaQA. arxiv:2310.05736 ↗

Chunk-level

Coarse-to-Fine Compression (LongLLMLingua)

First scores and filters document-level chunks (paragraphs, passages) before doing token-level compression within the selected chunks. Reduces computation cost of the compression step itself, making it practical for very long contexts (100k+ tokens).

LongLLMLingua (Jiang et al., Microsoft, 2023): extends LLMLingua to long-document RAG, maintaining quality on NaturalQuestions and MuSiQue benchmarks at 4× compression. arxiv:2310.06839 ↗

Task-aware

Question-Aware Compression (LLMLingua-2)

Incorporates the specific question or instruction when scoring token importance — tokens that are important for the question but not in isolation are preserved. Trained via extractive compression on GPT-4-generated training data, achieving better task alignment than perplexity-only scoring.

LLMLingua-2 (Pan et al., Microsoft, 2024): outperforms LLMLingua across all benchmarks at equivalent compression ratios; faster inference due to smaller compressor model. arxiv:2403.12968 ↗

What Compresses Well vs. What Doesn't

Prompt compression is not universally applicable. The savings depend heavily on your prompt structure and task type. Understanding where it works prevents misapplication.

High-compression candidates (5–20× possible):

RAG-retrieved document context for question answering
Long system prompts with verbose instructions
Conversation history where early turns are less relevant
Narrative documents for summarization
Few-shot examples with repetitive structure

Low-compression candidates (2× or less, proceed carefully):

Code and structured data — syntax-dependent, every token matters
Mathematical problems — precision required, no redundancy
Short prompts under 200 tokens — minimal redundancy to remove
Highly templated prompts already written for token efficiency

The safest starting point is RAG context. Retrievals are reliably redundant relative to the query, compression ratios are highest, and the quality impact is lowest because you're removing context the model wasn't using anyway.

Research Foundations

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

Jiang et al. — Microsoft — 2023 — arxiv:2310.05736

20×

max compression

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Pan et al. — Microsoft — 2024 — arxiv:2403.12968

<2%

quality drop

Selective Context: Filtering Irrelevant Context for Long-Document LLMs

Li et al. — 2023 — arxiv:2310.06201

50%

token reduction

Deploying Prompt Compression in Production

Adding prompt compression to an existing LLM application is an infrastructure change, not an application change. The compressor sits between your application and the LLM API — it receives the full prompt and returns a compressed version before forwarding.

Identify your compression targets — instrument your LLM calls to measure average prompt length by component: system prompt, retrieved context, conversation history, user query. The longest components with the most structural redundancy are your first targets.
Set compression budget — start conservative (2–3× compression ratio on context). Run compressed vs. uncompressed on a sample of your production queries and measure quality on your actual output metrics, not just benchmark proxies.
Compress selectively — compress RAG context and conversation history aggressively; compress system prompts conservatively; don't compress the user's actual query.
Monitor quality per task type — different task types have different sensitivity to compression. Track quality metrics (downstream task accuracy, user ratings, etc.) broken down by task type, not just in aggregate.
Combine with caching — compress the system prompt once, then cache the compressed version. Compress RAG context per query. You get the compression savings on every call and the caching savings on repeated prefixes.

Once compression is in place, layer in LLM model routing to send the compressed inputs to the cheapest capable model — and consider LoRA fine-tuning for high-volume task types where a smaller adapted model can replace the API call entirely.

"Prompt compression is the only LLM cost optimization that requires no model changes, no routing infrastructure, and no quality tradeoff calibration — you just make the input smaller. The savings are immediate and the risk is low."

Frequently Asked Questions

What is prompt compression?

Prompt compression removes redundant, low-information tokens from LLM inputs before sending them to the model. By stripping filler words, repeated context, and low-importance tokens, it reduces the token count of inputs by 2–20× while preserving the information the model needs to produce a correct response.

How does LLMLingua work?

LLMLingua uses a small language model (typically Llama or Falcon) to score the perplexity of each token in the prompt. Low-perplexity tokens — those that are predictable and redundant — are dropped. High-perplexity tokens — those that carry essential information — are preserved. The result is a compressed prompt that retains the key information at a fraction of the token count.

Does prompt compression work with all LLMs?

Yes — prompt compression is model-agnostic. It runs before the prompt reaches the target LLM, so it works with GPT-4, Claude, Gemini, or any other model. The compressor is a separate small model used only for token scoring, not the same model handling inference.

What is the difference between prompt compression and prompt caching?

Prompt caching (available in Claude and GPT-4o) stores a processed version of a repeated prefix to avoid re-processing it on each call — it reduces compute cost on repeated context. Prompt compression reduces the token count of the prompt itself before it reaches the model. They are complementary: compress first, then cache the compressed prefix.

Prompt Compression: Reduce LLM Token Costs by 2–20× Without Losing Quality