Every token in your LLM prompt costs money. Most of them don't need to be there. Prompt compression removes redundant tokens from inputs before they reach the model — cutting costs on input tokens without switching models, changing output quality, or touching your application logic.
The technique works because natural language is deeply redundant. A RAG retrieval that returns five 500-word passages to answer a simple question is sending thousands of tokens the model doesn't need. A system prompt written conversationally contains filler, repetition, and padding that carries no information. Prompt compression identifies which tokens are doing work and drops the rest.
Research from Microsoft shows compression ratios of 2–20× with under 2% quality degradation on benchmarks including CoQA, HotpotQA, and TriviaQA (Jiang et al., LLMLingua, 2023). At 5× compression on a workload spending $20,000/month on input tokens, that's $16,000 saved monthly — with no model changes.
Definition
Prompt compression is the removal of redundant, low-information tokens from LLM inputs before inference. A compression model scores each token's informativeness and drops those below a threshold — preserving the tokens that carry essential meaning while reducing overall token count. The compressed prompt produces equivalent output from the target LLM at a fraction of the input cost.
Why Prompts Are Redundant
Most production LLM inputs are longer than they need to be. This isn't a design flaw — it's the natural consequence of how LLM applications are built:
RAG retrieval over-fetches context. Retrieval systems return top-k passages to ensure coverage. But retrieved passages are rarely 100% relevant to the query — they're selected by semantic similarity, not information density. A typical RAG retrieval for a focused question returns 2,000–5,000 tokens of context where 400–800 tokens contain the actual answer.
System prompts are written like documentation. Engineers write system prompts for clarity and correctness, not token efficiency. Explanations, caveats, examples, and formatting guidance are all valuable for prompt engineering — and all more verbose than the minimum required by the model.
Few-shot examples are padded. Few-shot examples are chosen to be representative and readable. Representative examples include common phrasing patterns, which means they contain natural language redundancy by design.
Conversation history accumulates. Multi-turn applications send the entire conversation history with each request. Early turns in a conversation are often less relevant to the current question than recent turns — but they're included anyway for context completeness.
How Prompt Compression Works
The dominant approach, pioneered by LLMLingua, uses a small auxiliary language model to score the informativeness of each token in the prompt, then drops the least informative ones.
Perplexity-Based Token Dropping (LLMLingua)
A small model (Llama-2-7B or similar) computes the perplexity of each token — how surprising it is given its context. Tokens with low perplexity (predictable, redundant) are dropped. Tokens with high perplexity (unexpected, information-dense) are preserved. Compression ratio is controlled by a budget parameter.
LLMLingua (Jiang et al., Microsoft, 2023): 2–5× compression on instruction prompts; 5–20× on RAG context. Under 2% quality drop on CoQA, HotpotQA, TriviaQA. arxiv:2310.05736 ↗
Coarse-to-Fine Compression (LongLLMLingua)
First scores and filters document-level chunks (paragraphs, passages) before doing token-level compression within the selected chunks. Reduces computation cost of the compression step itself, making it practical for very long contexts (100k+ tokens).
LongLLMLingua (Jiang et al., Microsoft, 2023): extends LLMLingua to long-document RAG, maintaining quality on NaturalQuestions and MuSiQue benchmarks at 4× compression. arxiv:2310.06839 ↗
Question-Aware Compression (LLMLingua-2)
Incorporates the specific question or instruction when scoring token importance — tokens that are important for the question but not in isolation are preserved. Trained via extractive compression on GPT-4-generated training data, achieving better task alignment than perplexity-only scoring.
LLMLingua-2 (Pan et al., Microsoft, 2024): outperforms LLMLingua across all benchmarks at equivalent compression ratios; faster inference due to smaller compressor model. arxiv:2403.12968 ↗
What Compresses Well vs. What Doesn't
Prompt compression is not universally applicable. The savings depend heavily on your prompt structure and task type. Understanding where it works prevents misapplication.
High-compression candidates (5–20× possible):
- RAG-retrieved document context for question answering
- Long system prompts with verbose instructions
- Conversation history where early turns are less relevant
- Narrative documents for summarization
- Few-shot examples with repetitive structure
Low-compression candidates (2× or less, proceed carefully):
- Code and structured data — syntax-dependent, every token matters
- Mathematical problems — precision required, no redundancy
- Short prompts under 200 tokens — minimal redundancy to remove
- Highly templated prompts already written for token efficiency
The safest starting point is RAG context. Retrievals are reliably redundant relative to the query, compression ratios are highest, and the quality impact is lowest because you're removing context the model wasn't using anyway.
Research Foundations
Deploying Prompt Compression in Production
Adding prompt compression to an existing LLM application is an infrastructure change, not an application change. The compressor sits between your application and the LLM API — it receives the full prompt and returns a compressed version before forwarding.
- Identify your compression targets — instrument your LLM calls to measure average prompt length by component: system prompt, retrieved context, conversation history, user query. The longest components with the most structural redundancy are your first targets.
- Set compression budget — start conservative (2–3× compression ratio on context). Run compressed vs. uncompressed on a sample of your production queries and measure quality on your actual output metrics, not just benchmark proxies.
- Compress selectively — compress RAG context and conversation history aggressively; compress system prompts conservatively; don't compress the user's actual query.
- Monitor quality per task type — different task types have different sensitivity to compression. Track quality metrics (downstream task accuracy, user ratings, etc.) broken down by task type, not just in aggregate.
- Combine with caching — compress the system prompt once, then cache the compressed version. Compress RAG context per query. You get the compression savings on every call and the caching savings on repeated prefixes.
"Prompt compression is the only LLM cost optimization that requires no model changes, no routing infrastructure, and no quality tradeoff calibration — you just make the input smaller. The savings are immediate and the risk is low."
Frequently Asked Questions
What is prompt compression?
Prompt compression removes redundant, low-information tokens from LLM inputs before sending them to the model. By stripping filler words, repeated context, and low-importance tokens, it reduces the token count of inputs by 2–20× while preserving the information the model needs to produce a correct response.
How does LLMLingua work?
LLMLingua uses a small language model (typically Llama or Falcon) to score the perplexity of each token in the prompt. Low-perplexity tokens — those that are predictable and redundant — are dropped. High-perplexity tokens — those that carry essential information — are preserved. The result is a compressed prompt that retains the key information at a fraction of the token count.
Does prompt compression work with all LLMs?
Yes — prompt compression is model-agnostic. It runs before the prompt reaches the target LLM, so it works with GPT-4, Claude, Gemini, or any other model. The compressor is a separate small model used only for token scoring, not the same model handling inference.
What is the difference between prompt compression and prompt caching?
Prompt caching (available in Claude and GPT-4o) stores a processed version of a repeated prefix to avoid re-processing it on each call — it reduces compute cost on repeated context. Prompt compression reduces the token count of the prompt itself before it reaches the model. They are complementary: compress first, then cache the compressed prefix.