LLM Effective Cost Table 2026: Cache + Batch Real Prices

List prices are fiction for any optimized LLM workload. All three major providers now discount cached input by 90%, and all three discount batch processing by 50% on input and output. Stack them where stacking is supported and your effective price runs 5–50% of list. Claude Sonnet 4.6's $3.00 input becomes $0.15. gpt-5.4-nano's $0.20 becomes $0.01. This page computes the real numbers — every cell derived from provider list prices using a stated multiplier methodology, so you can check the arithmetic yourself.

Last verified: July 15, 2026

Every list price on this page was checked against the provider's own pricing page on 2026-06-10 and re-confirmed unchanged on 2026-07-15. Computed cells are pure arithmetic on those verified prices. See the changelog at the bottom for the verification history and the monthly re-check commitment.

LeanLM (not affiliated with Google's LearnLM educational AI) is an LLM cost optimization platform — this page is our reference table for what optimized workloads actually pay per million tokens.

Most LLM cost comparisons stop at list price, which makes them useless for anyone running LLM caching or asynchronous pipelines. The gap between list and effective price is now so large that list-price comparisons routinely point teams at the wrong provider. (And if your volume is high enough, the cheaper option may not be an API at all — see when self-hosting an LLM beats the API on cost.) Below: the effective-cost leaderboard, then the methodology, the full per-provider tables, and the value picks.

Effective Cost Leaderboard: Cheapest LLM APIs by Real Cost (July 2026)

The bottom line — as of July 15, 2026

The cheapest effective LLM API cost is DeepSeek v4-flash at $0.0028 per million input tokens on automatic cache hits — a 98% discount off its $0.14 list price. Among models that stack batch + cache discounts, gpt-5.4-nano leads at $0.01/MTok, and gemini-3.1-pro-preview is the cheapest flagship-class option at $0.10/MTok stacked. These are effective prices — what optimized workloads actually pay after caching and batch discounts — not list price, which overstates real cost by 20–50× and is what most "cheapest LLM" comparisons and benchmark leaderboards report.

Ranked by cheapest effective input cost per million tokens (stacked batch + cache where supported; DeepSeek uses its automatic cache-hit rate, which has no batch tier to stack). Every figure is derived from the provider's own list price using the stated multiplier methodology below — re-verified 2026-07-15, nothing recalled from memory or third-party trackers.

Rank	Model	Provider	Effective input $/MTok	How
1	deepseek-v4-flash	DeepSeek	$0.0028	automatic cache-hit
2	gpt-5.4-nano	OpenAI	$0.01	batch + cache stacked
3	gemini-3.1-flash-lite	Google	$0.0125	batch + cache stacked
4	gemini-2.5-flash	Google	$0.015	batch + cache stacked
5	gemini-3-flash-preview	Google	$0.025	batch + cache stacked
6	gpt-5.4-mini	OpenAI	$0.0375	batch + cache stacked
7	Claude Haiku 4.5	Anthropic	$0.05	batch + cache stacked
8	gemini-3.5-flash	Google	$0.075	batch + cache stacked
9	gemini-3.1-pro-preview	Google	$0.10	batch + cache stacked — cheapest flagship-class
10	gpt-5.4	OpenAI	$0.125	batch + cache stacked

"Effective" cost is what you actually pay after cache and batch discounts — the metric list-price and benchmark leaderboards omit. Ranking is by effective input cost; weight by your own input/output ratio and cache-hit rate to get your real number — the LLM cost calculator does this for your token mix. Full per-provider tables (output prices and every discount column) follow below.

Methodology: How Every Cell Is Computed

The formula

Effective input price = list price × batch multiplier (0.5) × cache-read multiplier (0.1), applied where the provider supports stacking. Example: Claude Sonnet 4.6 = $3.00 × 0.5 × 0.1 = $0.15/MTok — 95% off list. Output tokens get the batch multiplier only (caching never discounts output): $15.00 × 0.5 = $7.50/MTok. Reasoning tokens are a special case — billed at the plain list output rate, since they're freshly generated on every call and don't benefit from either discount.

Three inputs drive every computed cell:

Cache-read multiplier: 0.1×. All three majors now price cached input reads at 10% of list — OpenAI on GPT-5.x models (prompt caching guide; older models get less: 75% off on gpt-4.1, 50% on gpt-4o), Anthropic flat across the lineup (prompt caching docs), and Gemini 2.5+ (context caching docs — the discount rose from 75% to 90%, so older posts understate it).
Batch multiplier: 0.5×. OpenAI (Batch API), Anthropic, and Gemini (Batch API) all discount asynchronous batch traffic 50% on both input and output.
Cache-write amortization (Anthropic only). Anthropic charges 1.25× input to write a 5-minute cache entry and 2× for the 1-hour TTL. The cached columns below show the pure read price (0.1×); at a stated assumption of 1 write per 20 reads on the 5-minute tier, the blended multiplier is (1.25 + 20 × 0.1) ÷ 21 ≈ 0.155× list — e.g. Sonnet 4.6 blends to ~$0.46/MTok cached instead of $0.30. OpenAI has no write surcharge; Gemini's implicit caching has none either, while explicit caching bills a storage fee ($1.00/MTok/hr on Flash models, $4.50/MTok/hr on Pro) instead.

Who lets you stack batch + cache? Anthropic: yes, verbatim — its pricing page states caching multipliers "stack with other pricing modifiers, including the Batch API discount." OpenAI: yes for GPT-5+ models in the Batch API (pre-GPT-5 models are not supported for caching in Batch); Flex processing on the Responses API is OpenAI's recommended combine path — same 50% discount with full caching support. Gemini: yes for explicit context caching — discounted batch caching rates are published per model (e.g. gemini-3.5-flash cached input at $0.075 in batch vs $0.15 interactive); whether implicit caching applies inside batch jobs is undocumented.

Effective LLM cost per million input tokens: list versus cached versus batch versus stacked, June 2026 — The same model, four prices: list, cached (×0.1), batch (×0.5), and stacked (×0.05) — effective input cost spans a 20× range before you've changed a single model choice.

95%

off list price for stacked batch + cached input where providers support combining the two discounts — $3.00 list becomes $0.15 effective on Claude Sonnet 4.6.

The Full Effective Cost Tables (July 2026)

All prices in USD per 1 million tokens. Cached input is the provider's published cache-hit read price (always 0.1× list here). Batch input = list × 0.5. Batch + cache input = list × 0.5 × 0.1 = list × 0.05, shown only where the provider supports stacking. Batch output = list output × 0.5.

OpenAI

Source: OpenAI API pricing. Cached input is 90% off on all GPT-5.x models, with no cache-write surcharge and 24-hour cache retention by default on gpt-5.5 and later. Batch + cache stacking applies to GPT-5+ models only; Flex processing is the recommended path for combining the two.

Model	List input	Cached input	Batch input	Batch + cache input	List output	Batch output
gpt-5.5	$5.00	$0.50	$2.50	$0.25	$30.00	$15.00
gpt-5.4	$2.50	$0.25	$1.25	$0.125	$15.00	$7.50
gpt-5.4-mini	$0.75	$0.075	$0.375	$0.0375	$4.50	$2.25
gpt-5.4-nano	$0.20	$0.02	$0.10	$0.01	$1.25	$0.625

gpt-5.5-pro ($30.00 input / $180.00 output) is excluded because it offers no cached input pricing at all — there is no effective-cost story to compute. Pre-GPT-5 models do not get cached pricing inside the Batch API, so their batch + cache column would be their plain batch price.

Anthropic

Source: Anthropic pricing. The only provider that states batch + cache stacking verbatim on its pricing page. Cached input below is the pure read price; writes cost 1.25× list (5-min TTL) or 2× (1-hour) — at 1 write per 20 reads, blend to ~0.155× list (see methodology). Note Opus 4.8 is $5/$25; the older $15/$75 applied only to the deprecated Opus 4.1/4.

Model	List input	Cached input	Batch input	Batch + cache input	List output	Batch output
Claude Fable 5	$10.00	$1.00	$5.00	$0.50	$50.00	$25.00
Claude Opus 4.8	$5.00	$0.50	$2.50	$0.25	$25.00	$12.50
Claude Sonnet 4.6	$3.00	$0.30	$1.50	$0.15	$15.00	$7.50
Claude Haiku 4.5	$1.00	$0.10	$0.50	$0.05	$5.00	$2.50

Cache pre-warming requests (max_tokens: 0) are rejected inside batches, so plan cache writes on the interactive path. Fast mode (the Opus 4.6–4.8 latency premium) stacks with caching but is not available with Batch.

Google Gemini

Source: Gemini API pricing. The batch + cache column reflects explicit context caching inside batch jobs, for which Google publishes discounted rates per model; explicit caching also bills storage at $1.00/MTok/hr (Flash) or $4.50/MTok/hr (Pro), not included in the per-token figures below. Implicit (automatic) caching inside batch jobs is undocumented.

Model	List input	Cached input	Batch input	Batch + cache input	List output	Batch output
gemini-3.1-pro-preview*	$2.00	$0.20	$1.00	$0.10	$12.00	$6.00
gemini-3.5-flash	$1.50	$0.15	$0.75	$0.075	$9.00	$4.50
gemini-3-flash-preview	$0.50	$0.05	$0.25	$0.025	$3.00	$1.50
gemini-3.1-flash-lite	$0.25	$0.025	$0.125	$0.0125	$1.50	$0.75
gemini-2.5-flash	$0.30	$0.03	$0.15	$0.015	$2.50	$1.25

*gemini-3.1-pro-preview prices are the ≤200k-token-prompt tier; prompts over 200k tokens bill $4.00 input / $18.00 output ($0.40 cached) — double every computed cell for the long-context tier. Google's published batch rates ($1.00/$6.00 for 3.1 Pro, $0.75/$4.50 for 3.5 Flash, and so on) match the ×0.5 derivation exactly.

DeepSeek (caveat rows)

Source: DeepSeek pricing. DeepSeek caching is fully automatic — there's no cache_control to manage, and no batch-discount tier to stack, so this table has only three price columns. Cache-hit prices below are as of the April 26, 2026 price cut, which dropped hit pricing to roughly 1/10 of its prior level.

Model	Input (cache miss)	Input (cache hit)	Output
deepseek-v4-flash	$0.14	$0.0028	$0.28
deepseek-v4-pro	—	$0.003625*	—

*The v4-pro cache-hit price appears to be promotional — secondary sources put the regular rate at ~$0.0145 — and DeepSeek's miss/output prices for v4-pro were not independently verified for this table, so those cells are left blank rather than estimated. The v4-flash cache hit at $0.0028 is a 98% discount off its $0.14 miss price — the deepest cache discount of any provider listed.

Early Access

What's Your Effective Cost?

LeanLM profiles your production traffic, measures your real cache hit rate and batchable share, and computes your actual effective cost per model — then validates cheaper configurations against your own workload before anything ships. Join the waitlist below.

Best Value Picks, by Tier

Reading the stacked column (batch + cache input) across providers:

Cheapest effective flagship: gemini-3.1-pro-preview at $0.10/MTok stacked — 2.5× cheaper than gpt-5.5 ($0.25) and 5× cheaper than Claude Fable 5 ($0.50) on the same column. The caveat: that price requires explicit caching (with its storage fee) inside batch, where OpenAI's and Anthropic's paths are operationally simpler.
Cheapest effective workhorse: gemini-3.5-flash at $0.075/MTok stacked, edging out gpt-5.4 ($0.125) and Claude Sonnet 4.6 ($0.15). All three land within 2× of each other — close enough that quality on your task, not price, should decide.
Cheapest effective budget tier: gpt-5.4-nano at $0.01/MTok stacked, with gemini-3.1-flash-lite right behind at $0.0125 and gemini-2.5-flash at $0.015. If your traffic is cache-heavy but can't batch, DeepSeek v4-flash's $0.0028 cache-hit price undercuts everything — automatic caching, no integration work, but also no batch tier to stack.

One trend worth flagging: gemini-3.5-flash lists at 5× the input price of gemini-2.5-flash ($1.50 vs $0.30) and 3.6× on output ($9.00 vs $2.50). The Flash tier is no longer reflexively cheap — which makes caching discipline matter far more on 3.5 Flash than it ever did on 2.5. A 3.5 Flash workload with no caching costs 5× more than the 2.5 Flash workload it replaced; the same workload with a high cache hit rate ($0.15 cached) lands back in the old neighborhood.

To run these numbers against your own token counts and request volume, use the interactive LLM cost calculator.

Cheapest for What? A Decision Framework by Workload

"Which LLM API is cheapest" has no single answer, because the cheapest model changes with the shape of your workload. Reading the tables above through four common cases:

Cheapest for coding

For prototyping or low-volume coding, OpenRouter's free tier includes Qwen3 Coder at $0 within its rate limits. For production coding volume, DeepSeek's cached price becomes the cheapest paid option once your prompts share enough repeated context — system prompts and file contents are exactly the kind of stable prefix that hits a cache repeatedly.

Cheapest at high volume

At production scale list price stops mattering and the stacked column above decides. gpt-5.4-nano ($0.01/MTok) and gemini-3.1-flash-lite ($0.0125) lead on cached, batched traffic. Route only the subtasks that genuinely need more capability to a larger model — nano/lite tiers handle classification, extraction, and formatting at a fraction of the cost.

Cheapest with caching

DeepSeek v4-flash at $0.0028/MTok on cache hits is the cheapest number on this page — a 98% discount off its own list price, and cheaper than any stacked discount available on a major-lab model. It requires genuinely repeated context, and DeepSeek's caching is automatic (no cache_control to manage, and no batch tier to stack on top).

Cheapest raw price, no caching or batching

DeepSeek v4-flash at $0.14 / $0.28 per million tokens — the plain list rate, no integration work required. If you need a major lab specifically (data residency, support SLA, ecosystem), gpt-5.4-nano at $0.20 / $1.25 is the cheapest across OpenAI, Google, and Anthropic.

Is There a Free LLM API?

Yes — three genuinely usable free tiers as of July 2026, each with real (not marketing-only) limits worth knowing before you build against them:

Google Gemini (AI Studio)

Free access to Gemini's Flash-tier models, no credit card required. Limits are tier-based and change more often than Groq's or OpenRouter's published numbers — check your live rate limits in AI Studio rather than trusting a static figure. The most production-adjacent of the three: Google's own infrastructure, not a rate-limited proxy layer.

Groq

Free tier with published, predictable rate limits — Llama 3.3 70B at 30 requests/minute and 1,000 requests/day, no credit card required. Known for very low latency inference, which matters if your prototype is latency-sensitive.

OpenRouter

Roughly 28 free models (DeepSeek R1, Llama 3.3 70B, Qwen3 Coder, Gemma 3, others) behind one unified API. Limits start tighter — 20 requests/minute, 50/day — rising to 1,000/day after a one-time paid credit purchase. The advantage is one integration surface across many free models instead of separate provider accounts.

The honest caveat: all three are built for prototyping and low-volume use, not production traffic. Rate limits will throttle you well before meaningful scale — budget for a paid tier once you're validating with real users, not before.

Honest Caveats

Price is not quality. This table tells you what each model costs, not which model is good enough for your task. The cheapest capable model wins — and "capable" is an empirical question per task type, which is exactly what LLM model routing solves: route each query to the cheapest model that meets your quality bar, and let the effective prices above set the savings ceiling.

Prices change — this page changes with them. OpenAI's cached-input discount went from 50% (gpt-4o era) to 90% on GPT-5.x. Gemini's implicit cache discount rose from 75% to 90%. DeepSeek cut cache-hit prices 10× in April 2026. Any pricing table without a verification date is already suspect; this one carries a changelog and gets re-verified monthly against the provider pages.

Tokenizers differ, so cross-provider $/MTok is not apples-to-apples. Anthropic's docs note the Opus 4.7+ tokenizer "may use up to 35% more tokens for the same fixed text." A $3.00/MTok model that tokenizes your corpus into 30% more tokens is effectively a $3.90/MTok model for your workload. When comparing across providers, multiply by your measured tokens-per-document ratio, not just the rate card.

The stacked column assumes both discounts apply to the same token. Real workloads blend: some traffic is interactive (no batch), some prompts miss the cache, and Anthropic cache writes cost extra. Use the methodology above to blend by your actual hit rate and async share. The mechanics of getting hit rates up — prompt structure, prefix stability, TTL management — are covered in our prompt caching deep-dive, and the operational side of the 50% tier in the LLM batch API guide. For where these two levers fit in the broader stack, see enterprise LLM cost optimization.

Early Access

Stop Paying List Price

Most teams capture less than half of the discounts on this page because cache hit rates and batch routing are nobody's job. LeanLM makes them measurable — and validates every change against your production quality bar. Join the waitlist.

Frequently Asked Questions

What is the cheapest LLM API per million tokens in 2026?

On effective (not list) pricing as of June 2026: the cheapest cached input anywhere is DeepSeek v4-flash at $0.0028/MTok on cache hits. The cheapest batch + cache stacked input from a major provider is gpt-5.4-nano at $0.01/MTok ($0.20 list × 0.5 batch × 0.1 cache). The cheapest stacked flagship-class input is gemini-3.1-pro-preview at $0.10/MTok via explicit caching inside batch.

Is there a free LLM API?

Yes — three genuinely usable free tiers as of July 2026. Google's Gemini via AI Studio offers Flash-tier models with no credit card, on tier-based limits that change often enough that you should check them live rather than trust a static figure. Groq publishes predictable limits (Llama 3.3 70B at 30 requests/minute, 1,000/day) and is notably low-latency. OpenRouter exposes roughly 28 free models behind one API at 20 requests/minute and 50/day, rising to 1,000/day after a one-time paid credit. All three are built for prototyping, not production — rate limits will throttle you well before meaningful scale.

Which LLM API is cheapest for coding?

For prototyping and low-volume coding, OpenRouter's free tier includes Qwen3 Coder at $0 within its rate limits. For production coding volume, DeepSeek v4-flash's cache-hit price of $0.0028/MTok is the cheapest paid option, because coding prompts carry exactly the kind of stable repeated context — system prompts, file contents — that hits a cache reliably. If you need a major lab, gpt-5.4-nano at $0.20/$1.25 list (or $0.01/MTok stacked with batch and cache) is the cheapest across OpenAI, Google, and Anthropic.

Can you combine LLM batch discounts with prompt caching?

It depends on the provider. Anthropic: yes — the pricing page states caching multipliers stack with the Batch API discount. OpenAI: yes for GPT-5+ models in the Batch API (pre-GPT-5 models don't get caching in Batch); Flex processing is OpenAI's recommended way to combine the 50% discount with full caching. Google Gemini: yes for explicit context caching — discounted batch caching rates are published per model; implicit caching inside batch jobs is undocumented.

How do you calculate effective LLM cost per million tokens?

Effective input price = list price × batch multiplier (0.5 if the workload is asynchronous) × cache-read multiplier (0.1 on cache hits, where the provider supports stacking). Example: Claude Sonnet 4.6 at $3.00 list becomes $3.00 × 0.5 × 0.1 = $0.15/MTok — 95% off list. For Anthropic, add cache-write amortization: at 1 write (1.25× input) per 20 reads, the blended cache multiplier is ~0.155× instead of 0.1×. Weight the result by your actual cache hit rate.

How accurate and up to date is this LLM pricing table?

Every list price was verified against the provider's own pricing page on June 10, 2026 and re-confirmed unchanged on July 15, 2026, and every computed cell is derived from those list prices using the stated multiplier methodology — nothing is recalled from memory or copied from third-party aggregators. The page carries a changelog and a monthly re-verification commitment. DeepSeek cache-hit prices reflect the April 26, 2026 price cut and are labeled accordingly.

Changelog

2026-07-28 — Absorbed the standalone Cheapest LLM API comparison into this page (it now 301-redirects here). Added the Cheapest for What? decision framework and the free LLM API tiers section, plus two FAQ entries. The two pages were answering the same question from the same verified figures; consolidating puts one authoritative answer behind one URL.
2026-07-15 — Added the Effective Cost Leaderboard (models ranked by cheapest effective input cost). Re-verified every list price against the provider pricing pages — all unchanged since 2026-06-10.
2026-06-10 — Page published. All list prices verified against OpenAI, Anthropic, Google Gemini, and DeepSeek pricing pages; all effective-cost cells computed from those prices using the methodology above. DeepSeek hit prices reflect the April 26, 2026 price cut.

The LLM Effective Cost Table: What Cache + Batch Pricing Actually Costs in 2026