Prompt caching lets an LLM provider store the already-processed prefix of your prompt — the system prompt, tool definitions, and few-shot examples that repeat on every request — and bill it at a steep discount when it repeats. As of June 2026, all three major providers discount cached input by 90%: OpenAI on the GPT-5.x family, Anthropic flat across all current Claude models, and Google on Gemini 2.5 and later. The fine print is where the math changes: Anthropic charges a cache-write surcharge (1.25× input for the 5-minute TTL, 2× for 1-hour), Gemini's explicit caching adds a per-hour storage fee, and each provider has a different minimum cacheable length and cache lifetime.
This post is the economics reference: verified model-level prices for all three providers, the mechanics that determine your real hit rate, a worked example (5,000-token system prompt at 10,000 requests/day), and corrections to the stale numbers still circulating from the GPT-4o and Gemini 2.0 era. Prompt caching is the single highest-leverage technique in our LLM caching strategies guide, and a core lever in enterprise LLM cost optimization.
LeanLM (not affiliated with Google's LearnLM educational AI) is an LLM cost optimization platform — this post covers the caching economics we validate against production traffic.
Definition
Prompt caching is a provider-side optimization that stores the computed state (KV cache) of a prompt's leading tokens so that subsequent requests sharing that exact prefix skip recomputation and are billed at a reduced cached-input rate — 90% off list on OpenAI GPT-5.x, Anthropic Claude, and Gemini 2.5+ as of June 2026. It is exact-prefix matching, not similarity matching: that's semantic caching, a separate technique with separate economics.
How Prompt Caching Works — and Why Prompt Structure Decides Your Hit Rate
All three providers cache by exact prefix match. The provider hashes the leading tokens of your request; if an identical prefix was processed recently, those tokens are served from cache at the discounted rate. The match stops at the first differing token — everything after it is billed at full price.
That single rule dictates how you should structure prompts:
- Stable content first. System prompt, tool/function definitions, few-shot examples, policy text — anything identical across requests belongs at the top, in a byte-stable order.
- Volatile content last. The user message, retrieved RAG chunks, session state, and per-request variables go at the end, after the cacheable prefix.
- No dynamic bytes in the prefix. A timestamp, request ID, or randomly ordered tool list anywhere in the prefix invalidates the cache for every token after it. This is the most common reason teams see 90% pricing and a 20% hit rate.
- Meet the minimum. Prompts shorter than the provider's minimum cacheable length (1,024 tokens on OpenAI; 512–4,096 on Claude; 2,048–4,096 on Gemini, model-dependent) are never cached at all.
Caching and prompt compression compose rather than compete: cache the stable prefix at 90% off, and compress what you can't cache — the volatile per-request tail that always bills at full input rate.
Provider Mechanics Compared (Last Verified June 2026)
| Mechanic | OpenAI | Anthropic (Claude) | Google (Gemini) |
|---|---|---|---|
| Cached-read discount | Up to 90% — model-dependent: 50% gpt-4o, 75% gpt-4.1, 90% GPT-5.x, 98.75% gpt-realtime audio | 90% flat (reads at 0.1× input) across current models | 90% on 2.5+ (cached tokens = 10% of input); was 75% on Gemini 2.0 |
| Cache-write surcharge | None | 1.25× input (5-min TTL); 2× input (1-hr TTL) | None for implicit; explicit caching bills cached tokens + storage ($1.00/MTok/hr Flash, $4.50/MTok/hr Pro) |
| Minimum cacheable prompt | 1,024 tokens | Fable 5: 512 (1,024 on Bedrock); Opus 4.8 / Sonnet 4.6: 1,024; Haiku 4.5: 4,096 | 3.5 Flash & 3.1 Pro: 4,096; 2.5 Flash & 2.5 Pro: 2,048 |
| TTL / retention | In-memory ~5–10 min of inactivity (up to 1 hr); extended retention up to 24h on gpt-5.5, 5.5-pro, 5.4, 5.2 — default 24h on gpt-5.5+, no extra fee | 5 min default; 1 hr at 2× write price; reads refresh the TTL | Implicit: not user-controlled. Explicit: default 1 hour, configurable |
| Automatic vs explicit | Fully automatic, no code changes; prompt_cache_key for routing optimization |
Both (automatic mode new in 2026): single top-level cache_control, or explicit per-block breakpoints (max 4) |
Implicit automatic by default on 2.5+; explicit cached_content API for guaranteed hits |
Sources: OpenAI prompt caching guide · Anthropic prompt caching docs · Gemini context caching docs. Last verified June 2026.
Model-Level Cache Pricing (per 1M Tokens, June 2026)
OpenAI — GPT-5.x family
| Model | Input | Cached input | Output | Cache discount |
|---|---|---|---|---|
| gpt-5.5 (flagship) | $5.00 | $0.50 | $30.00 | 90% |
| gpt-5.5-pro | $30.00 | not offered | $180.00 | n/a |
| gpt-5.4 (workhorse) | $2.50 | $0.25 | $15.00 | 90% |
| gpt-5.4-mini | $0.75 | $0.075 | $4.50 | 90% |
| gpt-5.4-nano (budget) | $0.20 | $0.02 | $1.25 | 90% |
Source: OpenAI pricing. Note gpt-5.5-pro does not offer cached input at all. There is no write surcharge on OpenAI.
Anthropic — Claude 4.x/5 family
| Model | Input | 5m cache write | 1h cache write | Cache read | Output |
|---|---|---|---|---|---|
| Claude Fable 5 (flagship) | $10.00 | $12.50 | $20.00 | $1.00 | $50.00 |
| Claude Opus 4.8 | $5.00 | $6.25 | $10.00 | $0.50 | $25.00 |
| Claude Sonnet 4.6 (workhorse) | $3.00 | $3.75 | $6.00 | $0.30 | $15.00 |
| Claude Haiku 4.5 (budget) | $1.00 | $1.25 | $2.00 | $0.10 | $5.00 |
Source: Anthropic pricing. Opus 4.8 is $5/$25 — the older $15/$75 applies only to deprecated Opus 4.1/4. Anthropic notes Opus 4.7+ uses a new tokenizer that may use up to 35% more tokens for the same fixed text — factor that into cross-model comparisons.
Google — Gemini 3.x / 2.5 family
| Model | Input | Output | Cached tokens | Cache storage /1M tok/hr |
|---|---|---|---|---|
| gemini-3.1-pro-preview (flagship) | $2.00 (≤200k) / $4.00 (>200k) | $12.00 / $18.00 | $0.20 / $0.40 | $4.50 |
| gemini-3.5-flash (workhorse) | $1.50 | $9.00 | $0.15 | $1.00 |
| gemini-3-flash-preview | $0.50 | $3.00 | $0.05 | $1.00 |
| gemini-3.1-flash-lite (budget) | $0.25 | $1.50 | $0.025 | $1.00 |
| gemini-2.5-pro | $1.25 (≤200k) | $10.00 | $0.125 | $4.50 |
| gemini-2.5-flash | $0.30 | $2.50 | $0.03 | $1.00 |
| gemini-2.5-flash-lite | $0.10 | $0.40 | $0.01 | $1.00 |
Source: Gemini API pricing. Every cached-token rate is exactly 10% of the input rate — a uniform 90% discount across the lineup. Note the Flash price jump: gemini-3.5-flash at $1.50/$9.00 is 5×/3.6× the price of gemini-2.5-flash ($0.30/$2.50), which makes caching far more consequential on the new Flash tier.
To see how cached rates reshape the cross-provider price ladder once discounts stack, see our effective cost table.
Early Access
How Much Is Your Cache Miss Rate Costing You?
LeanLM profiles your production LLM traffic, measures your actual cache hit rate per route, and quantifies what prefix restructuring would save — before you change a line of prompt code. Get early access.
Worked Example: 5,000-Token System Prompt at 10,000 Requests/Day
Take a common production shape: a 5,000-token stable prefix (system prompt + tool definitions), 10,000 requests/day, traffic spread across the day (~7 requests/min — enough to keep every provider's cache warm: Anthropic reads refresh the 5-minute TTL, and it sits within OpenAI's ~15 requests/min per prefix+key guidance). Workhorse tier on each provider; figures cover the cached prefix only — per-request user input and output bill normally on top.
OpenAI gpt-5.4 ($2.50 input / $0.25 cached, no write surcharge):
- Uncached: 50M prefix tokens/day × $2.50/MTok = $125.00/day
- Cached: 1 miss (5,000 × $2.50/MTok = $0.013) + 9,999 hits (49.995M × $0.25/MTok = $12.50) = $12.51/day
Claude Sonnet 4.6 ($3.00 input / $3.75 5-min write / $0.30 read):
- Uncached: 50M × $3.00/MTok = $150.00/day
- Cached: 1 write (5,000 × $3.75/MTok = $0.019) + 9,999 reads (49.995M × $0.30/MTok = $15.00) = $15.02/day
- Write amortization: the 1.25× surcharge costs an extra 0.25× on the write, and each read saves 0.9× — so the surcharge pays for itself on the first cache hit. At 10,000 requests/day it rounds to nothing.
Gemini 3.5 Flash ($1.50 input / $0.15 cached, implicit caching — 5,000 tokens clears the 4,096 minimum):
- Uncached: 50M × $1.50/MTok = $75.00/day
- Cached: 1 miss ($0.008) + 9,999 hits (49.995M × $0.15/MTok = $7.50) = $7.51/day
- If you used explicit caching instead for guaranteed hits, add storage: 0.005 MTok × $1.00/MTok/hr × 24h = $0.12/day — still trivial against the savings.
| Provider / model | Prefix cost, uncached | Prefix cost, cached | Saved per month | Effective discount |
|---|---|---|---|---|
| OpenAI gpt-5.4 | $125.00/day | $12.51/day | ~$3,375 | 90.0% |
| Claude Sonnet 4.6 | $150.00/day | $15.02/day | ~$4,049 | 90.0% |
| Gemini 3.5 Flash | $75.00/day | $7.51/day | ~$2,025 | 90.0% |
Computed from the verified list prices above; 30-day months; assumes the cache stays warm all day. Lower request volume or bursty traffic lowers the hit rate and the realized discount.
One more lever stacks on top: all three providers offer 50% batch API discounts, and caching now combines with them — Anthropic's pricing page states the multipliers stack (Sonnet 4.6 batch + cache read = $3 × 0.5 × 0.1 = $0.15/MTok effective input, 95% off list); OpenAI supports cached input inside Batch for GPT-5+ models and recommends Flex processing as the combine path; Gemini publishes discounted batch context-caching rates for explicit caching.
Common Misconceptions (2026 Update)
Most prompt-caching content online predates the 2026 pricing changes. Each correction below is sourced to the provider's current documentation.
"OpenAI caching only saves 50%"
Outdated — that figure is from the GPT-4o era. Per OpenAI's current pricing, cached input is 75% off on gpt-4.1 and 90% off across the entire GPT-5.x family (gpt-realtime audio reaches 98.75%). If your cost model still assumes 50%, you're overestimating OpenAI input costs on cached traffic by 5×.
"Gemini's cache discount is 75%"
That was Gemini 2.0. On Gemini 2.5 and later, every published cached-token rate is exactly 10% of the input rate — a 90% discount across the lineup, from 3.1 Pro down to 2.5 Flash-Lite. Many older comparison posts still cite 75%.
"Caches die after 5 minutes, so low-volume traffic can't benefit"
No longer true on OpenAI: extended cache retention up to 24 hours is available on gpt-5.5, gpt-5.5-pro, gpt-5.4, and gpt-5.2 — and it's the default on gpt-5.5+ at no extra fee. Anthropic still defaults to 5 minutes, but reads refresh the TTL and a 1-hour tier is available at 2× write price.
"Claude caching requires manual cache_control breakpoints"
Anthropic added an automatic caching mode in 2026: a single top-level cache_control and the API places breakpoints for you. Explicit per-block breakpoints (max 4) remain available for fine-grained control. Also new: caches are isolated per workspace as of February 5, 2026 (previously per organization).
"Caching and Batch discounts don't stack"
Anthropic's pricing page now states caching multipliers stack with the Batch API discount — 95% off effective input on a batch cache read (one caveat: cache pre-warming with max_tokens: 0 is rejected in batches). OpenAI's flat "no" is gone too: GPT-5+ models get cached-input pricing inside Batch, and the cookbook measured Flex processing at +8.5pp cache-hit rate vs equivalent Batch jobs. Gemini supports explicit caching in batch with published discounted rates; implicit caching inside batch jobs is undocumented either way.
Worth a footnote for completeness: DeepSeek's automatic context caching is the most aggressive in the market — deepseek-v4-flash bills $0.14/MTok on a cache miss and $0.0028/MTok on a hit (98% off) as of the April 2026 price cut (DeepSeek pricing).
Production Case Studies
prompt_cache_key, so requests sharing a prefix consistently land on the same cache shard. OpenAI also reports caching cuts time-to-first-token by up to ~80%, and advises staying within ~15 requests/min per prefix+key combination. cookbook ↗Early Access
Validate Caching Gains on Your Real Traffic
Headline discounts assume perfect hit rates. LeanLM replays your production prompts against restructured, cache-optimized variants and reports the actual savings — and any quality deltas — before you ship. Join the waitlist.
Implementation Checklist
- Restructure for prefix stability. Stable content (system prompt, tools, examples) first in a fixed order; volatile content (user message, RAG chunks) last. Audit for timestamps, request IDs, and non-deterministic serialization in the prefix — one changing byte invalidates everything after it.
- Check the minimum cacheable length. 1,024 tokens (OpenAI); 512 on Claude Fable 5 (1,024 on Bedrock), 1,024 on Opus 4.8/Sonnet 4.6, 4,096 on Haiku 4.5; 4,096 on Gemini 3.5 Flash/3.1 Pro, 2,048 on 2.5 Flash/Pro. Short prompts silently never cache.
- Place breakpoints deliberately (Anthropic). Start with automatic mode; move to explicit breakpoints (max 4) when you need separate boundaries for tools, system prompt, and conversation history. Choose 5-minute vs 1-hour TTL based on traffic cadence — remember reads refresh the TTL, so steady traffic keeps the cheap 5-minute cache alive indefinitely.
- Route with
prompt_cache_key(OpenAI). The 60%→87% hit-rate case above came from this single change. Keep each prefix+key combination under ~15 requests/min; shard the key if you exceed it. - Monitor hit rate as a first-class metric. Track cached vs uncached input tokens from API responses per route. A hit-rate drop after a deploy almost always means a prefix regression. If you're optimizing further, layer semantic caching in front for repeated questions, and prompt compression on the uncacheable tail.
Frequently Asked Questions
How much does prompt caching save in 2026?
As of June 2026, all three major providers discount cached input tokens by 90%: OpenAI on the GPT-5.x family, Anthropic flat across Claude models (cache reads at 0.1× input), and Google on Gemini 2.5 and later. Anthropic charges a cache-write surcharge (1.25× input for 5-minute TTL, 2× for 1-hour) and Gemini explicit caching adds a storage fee, so realized savings depend on hit rate — at high request volumes the effective discount approaches the full 90%.
What is the difference between Anthropic and OpenAI prompt caching?
OpenAI caching is fully automatic with no code changes, no write surcharge, a 1,024-token minimum, and extended retention up to 24 hours (default on gpt-5.5 and later, no extra fee). Anthropic offers both an automatic mode (new in 2026) and explicit cache_control breakpoints (max 4), charges 1.25× input to write a 5-minute cache (2× for 1-hour), and reads refresh the TTL. Both read cached tokens at 90% off; Anthropic's surcharge pays for itself on the first cache hit.
How long do cached prompts last?
OpenAI: in-memory caches persist roughly 5–10 minutes of inactivity (up to 1 hour), with extended retention up to 24 hours on gpt-5.5, gpt-5.5-pro, gpt-5.4, and gpt-5.2 — default 24h on gpt-5.5+ at no extra fee. Anthropic: 5 minutes by default, 1 hour at 2× write price, and every cache read refreshes the TTL. Gemini: implicit caching lifetime is not user-controlled; explicit caching defaults to 1 hour and is configurable.
Does prompt caching stack with Batch API discounts?
Increasingly, yes. Anthropic states explicitly that caching multipliers stack with the Batch API's 50% discount — Claude Sonnet 4.6 batch + cache read works out to $0.15 per million input tokens, 95% off list. OpenAI supports cached-input pricing inside Batch for GPT-5+ models (pre-GPT-5 models are not supported), and recommends Flex processing for the full 50% discount plus caching. Gemini publishes discounted batch context-caching rates for explicit caching; implicit caching inside batch jobs is undocumented.