Prompt caching lets an LLM provider store the already-processed prefix of your prompt — the system prompt, tool definitions, and few-shot examples that repeat on every request — and bill it at a steep discount when it repeats. As of June 2026, all three major providers discount cached input by 90%: OpenAI on the GPT-5.x family, Anthropic flat across all current Claude models, and Google on Gemini 2.5 and later. The fine print is where the math changes: Anthropic charges a cache-write surcharge (1.25× input for the 5-minute TTL, 2× for 1-hour), Gemini's explicit caching adds a per-hour storage fee, and each provider has a different minimum cacheable length and cache lifetime.

This post is the economics reference: verified model-level prices for all three providers, the mechanics that determine your real hit rate, a worked example (5,000-token system prompt at 10,000 requests/day), and corrections to the stale numbers still circulating from the GPT-4o and Gemini 2.0 era. Prompt caching is the single highest-leverage technique in our LLM caching strategies guide, and a core lever in enterprise LLM cost optimization.

LeanLM (not affiliated with Google's LearnLM educational AI) is an LLM cost optimization platform — this post covers the caching economics we validate against production traffic.

Definition

Prompt caching is a provider-side optimization that stores the computed state (KV cache) of a prompt's leading tokens so that subsequent requests sharing that exact prefix skip recomputation and are billed at a reduced cached-input rate — 90% off list on OpenAI GPT-5.x, Anthropic Claude, and Gemini 2.5+ as of June 2026. It is exact-prefix matching, not similarity matching: that's semantic caching, a separate technique with separate economics.

Prompt caching economics compared across OpenAI, Anthropic Claude, and Google Gemini, June 2026
Same 90% headline discount, three different cost models: free automatic writes (OpenAI), surcharged writes (Anthropic), and storage-billed explicit caches (Gemini).
90%
cached-input discount on all three major providers as of June 2026 — up from 50% (OpenAI, GPT-4o era) and 75% (Gemini 2.0). The write surcharges and TTLs below are what separate the headline from your invoice.

How Prompt Caching Works — and Why Prompt Structure Decides Your Hit Rate

All three providers cache by exact prefix match. The provider hashes the leading tokens of your request; if an identical prefix was processed recently, those tokens are served from cache at the discounted rate. The match stops at the first differing token — everything after it is billed at full price.

That single rule dictates how you should structure prompts:

Caching and prompt compression compose rather than compete: cache the stable prefix at 90% off, and compress what you can't cache — the volatile per-request tail that always bills at full input rate.

Provider Mechanics Compared (Last Verified June 2026)

Mechanic OpenAI Anthropic (Claude) Google (Gemini)
Cached-read discount Up to 90% — model-dependent: 50% gpt-4o, 75% gpt-4.1, 90% GPT-5.x, 98.75% gpt-realtime audio 90% flat (reads at 0.1× input) across current models 90% on 2.5+ (cached tokens = 10% of input); was 75% on Gemini 2.0
Cache-write surcharge None 1.25× input (5-min TTL); 2× input (1-hr TTL) None for implicit; explicit caching bills cached tokens + storage ($1.00/MTok/hr Flash, $4.50/MTok/hr Pro)
Minimum cacheable prompt 1,024 tokens Fable 5: 512 (1,024 on Bedrock); Opus 4.8 / Sonnet 4.6: 1,024; Haiku 4.5: 4,096 3.5 Flash & 3.1 Pro: 4,096; 2.5 Flash & 2.5 Pro: 2,048
TTL / retention In-memory ~5–10 min of inactivity (up to 1 hr); extended retention up to 24h on gpt-5.5, 5.5-pro, 5.4, 5.2 — default 24h on gpt-5.5+, no extra fee 5 min default; 1 hr at 2× write price; reads refresh the TTL Implicit: not user-controlled. Explicit: default 1 hour, configurable
Automatic vs explicit Fully automatic, no code changes; prompt_cache_key for routing optimization Both (automatic mode new in 2026): single top-level cache_control, or explicit per-block breakpoints (max 4) Implicit automatic by default on 2.5+; explicit cached_content API for guaranteed hits

Sources: OpenAI prompt caching guide · Anthropic prompt caching docs · Gemini context caching docs. Last verified June 2026.

Model-Level Cache Pricing (per 1M Tokens, June 2026)

OpenAI — GPT-5.x family

Model Input Cached input Output Cache discount
gpt-5.5 (flagship)$5.00$0.50$30.0090%
gpt-5.5-pro$30.00not offered$180.00n/a
gpt-5.4 (workhorse)$2.50$0.25$15.0090%
gpt-5.4-mini$0.75$0.075$4.5090%
gpt-5.4-nano (budget)$0.20$0.02$1.2590%

Source: OpenAI pricing. Note gpt-5.5-pro does not offer cached input at all. There is no write surcharge on OpenAI.

Anthropic — Claude 4.x/5 family

Model Input 5m cache write 1h cache write Cache read Output
Claude Fable 5 (flagship)$10.00$12.50$20.00$1.00$50.00
Claude Opus 4.8$5.00$6.25$10.00$0.50$25.00
Claude Sonnet 4.6 (workhorse)$3.00$3.75$6.00$0.30$15.00
Claude Haiku 4.5 (budget)$1.00$1.25$2.00$0.10$5.00

Source: Anthropic pricing. Opus 4.8 is $5/$25 — the older $15/$75 applies only to deprecated Opus 4.1/4. Anthropic notes Opus 4.7+ uses a new tokenizer that may use up to 35% more tokens for the same fixed text — factor that into cross-model comparisons.

Google — Gemini 3.x / 2.5 family

Model Input Output Cached tokens Cache storage /1M tok/hr
gemini-3.1-pro-preview (flagship)$2.00 (≤200k) / $4.00 (>200k)$12.00 / $18.00$0.20 / $0.40$4.50
gemini-3.5-flash (workhorse)$1.50$9.00$0.15$1.00
gemini-3-flash-preview$0.50$3.00$0.05$1.00
gemini-3.1-flash-lite (budget)$0.25$1.50$0.025$1.00
gemini-2.5-pro$1.25 (≤200k)$10.00$0.125$4.50
gemini-2.5-flash$0.30$2.50$0.03$1.00
gemini-2.5-flash-lite$0.10$0.40$0.01$1.00

Source: Gemini API pricing. Every cached-token rate is exactly 10% of the input rate — a uniform 90% discount across the lineup. Note the Flash price jump: gemini-3.5-flash at $1.50/$9.00 is 5×/3.6× the price of gemini-2.5-flash ($0.30/$2.50), which makes caching far more consequential on the new Flash tier.

To see how cached rates reshape the cross-provider price ladder once discounts stack, see our effective cost table.

Early Access

How Much Is Your Cache Miss Rate Costing You?

LeanLM profiles your production LLM traffic, measures your actual cache hit rate per route, and quantifies what prefix restructuring would save — before you change a line of prompt code. Get early access.

You're on the list. We'll be in touch.

No spam, ever.

Worked Example: 5,000-Token System Prompt at 10,000 Requests/Day

Take a common production shape: a 5,000-token stable prefix (system prompt + tool definitions), 10,000 requests/day, traffic spread across the day (~7 requests/min — enough to keep every provider's cache warm: Anthropic reads refresh the 5-minute TTL, and it sits within OpenAI's ~15 requests/min per prefix+key guidance). Workhorse tier on each provider; figures cover the cached prefix only — per-request user input and output bill normally on top.

OpenAI gpt-5.4 ($2.50 input / $0.25 cached, no write surcharge):

Claude Sonnet 4.6 ($3.00 input / $3.75 5-min write / $0.30 read):

Gemini 3.5 Flash ($1.50 input / $0.15 cached, implicit caching — 5,000 tokens clears the 4,096 minimum):

Provider / model Prefix cost, uncached Prefix cost, cached Saved per month Effective discount
OpenAI gpt-5.4$125.00/day$12.51/day~$3,37590.0%
Claude Sonnet 4.6$150.00/day$15.02/day~$4,04990.0%
Gemini 3.5 Flash$75.00/day$7.51/day~$2,02590.0%

Computed from the verified list prices above; 30-day months; assumes the cache stays warm all day. Lower request volume or bursty traffic lowers the hit rate and the realized discount.

One more lever stacks on top: all three providers offer 50% batch API discounts, and caching now combines with them — Anthropic's pricing page states the multipliers stack (Sonnet 4.6 batch + cache read = $3 × 0.5 × 0.1 = $0.15/MTok effective input, 95% off list); OpenAI supports cached input inside Batch for GPT-5+ models and recommends Flex processing as the combine path; Gemini publishes discounted batch context-caching rates for explicit caching.

Common Misconceptions (2026 Update)

Most prompt-caching content online predates the 2026 pricing changes. Each correction below is sourced to the provider's current documentation.

50% → 90%

"OpenAI caching only saves 50%"

Outdated — that figure is from the GPT-4o era. Per OpenAI's current pricing, cached input is 75% off on gpt-4.1 and 90% off across the entire GPT-5.x family (gpt-realtime audio reaches 98.75%). If your cost model still assumes 50%, you're overestimating OpenAI input costs on cached traffic by 5×.

75% → 90%

"Gemini's cache discount is 75%"

That was Gemini 2.0. On Gemini 2.5 and later, every published cached-token rate is exactly 10% of the input rate — a 90% discount across the lineup, from 3.1 Pro down to 2.5 Flash-Lite. Many older comparison posts still cite 75%.

24h

"Caches die after 5 minutes, so low-volume traffic can't benefit"

No longer true on OpenAI: extended cache retention up to 24 hours is available on gpt-5.5, gpt-5.5-pro, gpt-5.4, and gpt-5.2 — and it's the default on gpt-5.5+ at no extra fee. Anthropic still defaults to 5 minutes, but reads refresh the TTL and a 1-hour tier is available at 2× write price.

auto

"Claude caching requires manual cache_control breakpoints"

Anthropic added an automatic caching mode in 2026: a single top-level cache_control and the API places breakpoints for you. Explicit per-block breakpoints (max 4) remain available for fine-grained control. Also new: caches are isolated per workspace as of February 5, 2026 (previously per organization).

95% off

"Caching and Batch discounts don't stack"

Anthropic's pricing page now states caching multipliers stack with the Batch API discount — 95% off effective input on a batch cache read (one caveat: cache pre-warming with max_tokens: 0 is rejected in batches). OpenAI's flat "no" is gone too: GPT-5+ models get cached-input pricing inside Batch, and the cookbook measured Flex processing at +8.5pp cache-hit rate vs equivalent Batch jobs. Gemini supports explicit caching in batch with published discounted rates; implicit caching inside batch jobs is undocumented either way.

Worth a footnote for completeness: DeepSeek's automatic context caching is the most aggressive in the market — deepseek-v4-flash bills $0.14/MTok on a cache miss and $0.0028/MTok on a hit (98% off) as of the April 2026 price cut (DeepSeek pricing).

Production Case Studies

Notion AI
Production AI features on Claude — large stable instruction prefix per feature
Anthropic's published customer result: prompt caching reduced Notion's costs by 90% and latency by up to 85%. The shape fits the technique perfectly — long, stable per-feature instructions reused across millions of requests. anthropic.com ↗ · claude.com/customers ↗
90%
cost reduction (85% latency)
OpenAI Cookbook: coding customer
High-volume coding workload — cache routing optimization
From OpenAI's "Prompt Caching 201" cookbook: a coding customer raised cache hit rate from 60% to 87% by adopting prompt_cache_key, so requests sharing a prefix consistently land on the same cache shard. OpenAI also reports caching cuts time-to-first-token by up to ~80%, and advises staying within ~15 requests/min per prefix+key combination. cookbook ↗
60→87%
cache hit rate
Claude Code (Anthropic first-party)
Agentic coding tool — architecture designed around one hot cache prefix
Anthropic's own docs describe Claude Code as architected around a single hot cache prefix: stable system prompt first, deferred tool loading, and append-only skills so the prefix never breaks mid-session — with cache reads billed at roughly 10% of the standard input rate. The lesson generalizes: design the prompt assembly order around the cache, not the other way around. code.claude.com ↗ · claude.com/blog ↗
~10%
of standard input rate on reads

Early Access

Validate Caching Gains on Your Real Traffic

Headline discounts assume perfect hit rates. LeanLM replays your production prompts against restructured, cache-optimized variants and reports the actual savings — and any quality deltas — before you ship. Join the waitlist.

You're on the list. We'll be in touch.

No spam, ever.

Implementation Checklist

  1. Restructure for prefix stability. Stable content (system prompt, tools, examples) first in a fixed order; volatile content (user message, RAG chunks) last. Audit for timestamps, request IDs, and non-deterministic serialization in the prefix — one changing byte invalidates everything after it.
  2. Check the minimum cacheable length. 1,024 tokens (OpenAI); 512 on Claude Fable 5 (1,024 on Bedrock), 1,024 on Opus 4.8/Sonnet 4.6, 4,096 on Haiku 4.5; 4,096 on Gemini 3.5 Flash/3.1 Pro, 2,048 on 2.5 Flash/Pro. Short prompts silently never cache.
  3. Place breakpoints deliberately (Anthropic). Start with automatic mode; move to explicit breakpoints (max 4) when you need separate boundaries for tools, system prompt, and conversation history. Choose 5-minute vs 1-hour TTL based on traffic cadence — remember reads refresh the TTL, so steady traffic keeps the cheap 5-minute cache alive indefinitely.
  4. Route with prompt_cache_key (OpenAI). The 60%→87% hit-rate case above came from this single change. Keep each prefix+key combination under ~15 requests/min; shard the key if you exceed it.
  5. Monitor hit rate as a first-class metric. Track cached vs uncached input tokens from API responses per route. A hit-rate drop after a deploy almost always means a prefix regression. If you're optimizing further, layer semantic caching in front for repeated questions, and prompt compression on the uncacheable tail.

Frequently Asked Questions

How much does prompt caching save in 2026?

As of June 2026, all three major providers discount cached input tokens by 90%: OpenAI on the GPT-5.x family, Anthropic flat across Claude models (cache reads at 0.1× input), and Google on Gemini 2.5 and later. Anthropic charges a cache-write surcharge (1.25× input for 5-minute TTL, 2× for 1-hour) and Gemini explicit caching adds a storage fee, so realized savings depend on hit rate — at high request volumes the effective discount approaches the full 90%.

What is the difference between Anthropic and OpenAI prompt caching?

OpenAI caching is fully automatic with no code changes, no write surcharge, a 1,024-token minimum, and extended retention up to 24 hours (default on gpt-5.5 and later, no extra fee). Anthropic offers both an automatic mode (new in 2026) and explicit cache_control breakpoints (max 4), charges 1.25× input to write a 5-minute cache (2× for 1-hour), and reads refresh the TTL. Both read cached tokens at 90% off; Anthropic's surcharge pays for itself on the first cache hit.

How long do cached prompts last?

OpenAI: in-memory caches persist roughly 5–10 minutes of inactivity (up to 1 hour), with extended retention up to 24 hours on gpt-5.5, gpt-5.5-pro, gpt-5.4, and gpt-5.2 — default 24h on gpt-5.5+ at no extra fee. Anthropic: 5 minutes by default, 1 hour at 2× write price, and every cache read refreshes the TTL. Gemini: implicit caching lifetime is not user-controlled; explicit caching defaults to 1 hour and is configurable.

Does prompt caching stack with Batch API discounts?

Increasingly, yes. Anthropic states explicitly that caching multipliers stack with the Batch API's 50% discount — Claude Sonnet 4.6 batch + cache read works out to $0.15 per million input tokens, 95% off list. OpenAI supports cached-input pricing inside Batch for GPT-5+ models (pre-GPT-5 models are not supported), and recommends Flex processing for the full 50% discount plus caching. Gemini publishes discounted batch context-caching rates for explicit caching; implicit caching inside batch jobs is undocumented.