All three major LLM providers — OpenAI, Anthropic, and Google — give a flat 50% discount on both input and output tokens for requests submitted through their batch APIs, in exchange for asynchronous processing inside a 24-hour window. In practice the window rarely matters: Anthropic's documentation says most batches complete within 1 hour, and Google says the 24-hour target is beaten "in the majority of cases." No tiering, no negotiation, no commitment — this is the most reliable discount in LLM pricing. And it stacks with prompt caching in provider-specific ways that almost nobody documents clearly, which is what the second half of this post settles.
LeanLM (not affiliated with Google's LearnLM educational AI) is an LLM cost optimization platform — this post is part of our series on LLM caching strategies and discount stacking.
Definition
An LLM batch API is an asynchronous submission endpoint where you upload many requests at once, the provider processes them on spare capacity within a stated window (24 hours at OpenAI, Anthropic, and Google), and you collect the results when the job completes. Because the provider controls scheduling, all three price batched tokens at exactly half the synchronous rate — input and output alike.
How Batch APIs Work — and the Limits, Side by Side
The mechanics are nearly identical across providers: you submit a file (or inline payload) of independent requests, each tagged with a custom ID; the job processes asynchronously; you poll for status and download results. Batched traffic draws on separate rate-limit pools — OpenAI states batch usage doesn't consume your synchronous rate limits, which makes batch a free capacity expansion as well as a discount.
| Provider | Discount | Turnaround | Batch limits | Stacks with caching? |
|---|---|---|---|---|
| OpenAI (Batch API) | 50% in + out | 24h window, "often more quickly" | 50,000 requests/batch; 200 MB input file; 2,000 batches/hr | Partial — GPT-5+ models only; Flex is the full-caching path |
| Anthropic (Message Batches API) | 50% in + out | Most batches <1 hour; 24h ceiling, expired requests not billed | 100,000 requests or 256 MB per batch; results kept 29 days | Yes — explicitly stated on the pricing page |
| Google (Gemini Batch API) | 50% in + out | 24h target, "in majority of cases, it is much quicker" | 2 GB input file; inline requests <20 MB | Yes for explicit caching (published batch rates); implicit undocumented |
Last verified June 2026 against each provider's live documentation: OpenAI batch guide, Anthropic batch processing docs, Gemini batch API docs.
One operational note on the 24-hour ceiling: it's a hard expiry, not a soft target. Anthropic's docs state that batches expire if processing doesn't complete within 24 hours — you aren't billed for expired requests, but your pipeline needs to detect and resubmit them. Under high demand, providers warn that more requests may expire. Design for the ceiling, enjoy the typical sub-hour reality.
What's Batchable — and What Isn't
The 50% discount is only useful for workloads that tolerate latency. In production these consistently are:
- Evals. Regression suites, model comparisons, prompt A/B scoring — thousands of test cases with no human waiting.
- Embeddings and metadata backfills. Re-tagging, re-summarizing, or re-classifying an existing corpus after a schema or model change.
- Document processing pipelines. Contract extraction, report summarization, OCR cleanup — anything queue-driven.
- Classification and moderation backlogs. Ticket triage, content moderation sweeps, lead scoring runs.
- Nightly agent steps. Plenty of agentic work isn't interactive — overnight research compilation, scheduled report generation, periodic data reconciliation. Moving these offline is one of the cheapest wins in AI agent cost optimization, because agent steps are exactly the high-volume, repeated-prefix traffic that batch + caching rewards most.
- Synthetic data generation. Training-set augmentation and distillation corpora, where volume is huge and deadlines are days, not seconds.
What isn't batchable: anything interactive. Chat, search, copilots, support agents mid-conversation — any flow where a person is waiting cannot absorb even the typical sub-hour turnaround, let alone the 24-hour worst case. For interactive traffic, your levers are prompt caching and model routing instead.
Early Access
How Much of Your Traffic Is Secretly Batchable?
LeanLM profiles your production LLM traffic, identifies which calls tolerate async processing, and quantifies the batch + caching savings before you change a line of code. Get early access.
Does Batch Stack With Prompt Caching? The Per-Provider Rules
This is where the documentation gets thin and the misinformation gets loud. Older posts claim batch and caching never combine; others claim they always do. Both are wrong. Here is each provider's actual rule, as of June 2026.
Anthropic: Yes — Verbatim on the Pricing Page
Anthropic is the only provider that states the rule outright: caching multipliers "stack with other pricing modifiers, including the Batch API discount." The multipliers literally multiply. Worked out for Claude Sonnet 4.6: $3/M list input × 0.5 batch × 0.1 cache read = $0.15/M effective input — 95% off list.
One caveat: cache pre-warming requests (max_tokens: 0) are rejected inside batches — an ephemeral cache entry written mid-batch would likely expire before any follow-up runs. Instead, Anthropic's batch docs recommend using the 1-hour cache TTL on shared prefixes so concurrent batch requests hit the same cache entry.
OpenAI: Partial in Batch — Flex Is the Full-Stack Path
Caching inside the Batch API works only on newer models: per OpenAI, "there isn't model parity for caching on the Batch API — pre-GPT-5 models are not supported." So GPT-5-and-later batches can receive cached-input pricing; older models in batch get the 50% discount and nothing more.
For the full combination, OpenAI's recommended route is Flex processing on the Responses API: the same 50% discount plus full prompt caching and prompt_cache_key routing support. The OpenAI cookbook measured Flex at +8.5 percentage points cache-hit rate versus an equivalent Batch workload — Flex requests flow through the same cache-aware routing as synchronous traffic, while Batch scheduling scatters them.
Gemini: Explicit Caching Yes (Published Rates) — Implicit Undocumented
Google settles the question by publishing the numbers: explicit context caching is supported for batch requests, and the pricing page lists discounted batch caching rates per model — gemini-3.5-flash cached tokens cost $0.075/M in batch versus $0.15/M synchronous. Use the cached_content API for guaranteed hits, and remember explicit caching adds a storage fee ($1.00/MTok/hr on Flash, $4.50 on Pro).
The gap: whether implicit (automatic) caching applies inside batch jobs is undocumented either way. Don't budget around it — if the discount matters to your unit economics, use explicit caching.
Worked Example: 1M Classification Calls per Month
Take a representative offline workload: 1 million classification calls per month, each with 2,000 input tokens (a 1,500-token shared prefix — instructions, taxonomy, few-shot examples — plus 500 unique tokens) and 100 output tokens. That's 2B input tokens (1.5B cacheable) and 100M output tokens monthly. Here's the bill on each provider's workhorse model, synchronous-uncached versus batch-plus-cache, using June 2026 list prices:
| Model | Sync, no cache | Batch + cache | Savings |
|---|---|---|---|
| Claude Sonnet 4.6 ($3/$15) | $7,500 | $1,725 | 77% |
| gpt-5.4 ($2.50/$15) | $6,500 | $1,750 | 73% |
| gemini-3.5-flash ($1.50/$9) | $3,900 | ~$939 | 76% |
Sonnet 4.6: 1.5B cached × $0.15/M + 0.5B unique × $1.50/M + 100M out × $7.50/M (cache writes on the 1-hour TTL are a rounding error at this scale). gpt-5.4: cached input billed at the $0.25/M cached rate inside batch — whether the batch discount also applies on top of the cached rate isn't documented, so this is the conservative figure — plus 0.5B × $1.25/M + 100M × $7.50/M. gemini-3.5-flash: 1.5B × $0.075/M batch-cached + 0.5B × $0.75/M + 100M × $4.50/M + ~$1/month explicit cache storage. List prices from each provider's pricing page, June 2026 — see our effective cost table for every model and multiplier.
Three different stacking rules, one consistent outcome: roughly three-quarters of the bill disappears, with zero quality impact — these are the same models producing the same outputs, just scheduled differently and reusing a prefix. Compare that with the single-digit-percent gains most teams chase through prompt trimming. For enterprises, batch migration is consistently among the highest-leverage items in enterprise LLM cost optimization because it requires no model change and no quality re-validation.
Early Access
Want This Math Run on Your Actual Workload?
LeanLM computes effective per-token costs across providers — batch, caching, and routing multipliers included — on your real traffic mix, then validates the cheapest configuration before anything ships. Join the waitlist.
Decision Checklist Before You Migrate
- Latency tolerance: design for 24 hours, not the typical hour. If a downstream system or SLA assumes results within N hours where N < 24, you need a fallback path for the slow tail.
- Retry semantics: make every request idempotent and resubmittable. Requests that expire at the 24-hour mark aren't billed (Anthropic states this explicitly), but they also aren't done. Tag requests with stable custom IDs and build the resubmission loop before production, not after the first expiry incident.
- Check your provider's caching stack rule. On OpenAI, confirm your model is GPT-5 or later before assuming cached-input pricing in Batch — or use Flex processing for the full caching stack at the same 50% off. On Gemini, use explicit caching; implicit-in-batch is undocumented.
- Don't pre-warm Anthropic caches into a batch.
max_tokens: 0pre-warming requests are rejected inside batches. The supported pattern is the 1-hour cache TTL (2× write cost on the prefix, paid once) so concurrent batch requests share the entry. - Run the comparison against your sync + caching baseline. If your workload already hits 90% cache reads synchronously, batch adds its 50% only on top of what's left — still significant, but verify the delta justifies the pipeline work. The arithmetic above is the template.
"Batch is the rare LLM discount with no quality trade-off to validate — same model, same outputs, half the price. The only engineering question is whether your pipeline can wait, and for most non-interactive workloads the honest answer is yes."
How LeanLM Fits In
LeanLM profiles production LLM traffic and classifies every call by latency tolerance, prefix reuse, and model tier. The output is a concrete migration plan — which calls move to batch, which provider's stacking rules apply, what the effective per-token rate becomes — validated against your real traffic before anything changes. Batch + caching is typically the first plan we generate because it's the only optimization that needs no quality re-testing.
Frequently Asked Questions
How much does the OpenAI Batch API save?
A flat 50% on both input and output tokens versus synchronous pricing, with results returned within a 24-hour window — often more quickly, per OpenAI's documentation. Limits as of June 2026: 50,000 requests per batch, a 200 MB input file, and 2,000 batches enqueued per hour, drawing on a separate rate-limit pool from your synchronous traffic.
Does the batch discount stack with prompt caching?
It depends on the provider. Anthropic: yes, explicitly — the pricing page states caching multipliers stack with the Batch API discount, so Sonnet 4.6 cached input in a batch costs $3 × 0.5 × 0.1 = $0.15/M (95% off list). OpenAI: partially — caching inside the Batch API works only on GPT-5-and-later models; Flex processing offers the same 50% discount with full caching and prompt_cache_key support. Gemini: explicit context caching is supported in batch with published discounted rates (gemini-3.5-flash cached tokens are $0.075/M in batch vs $0.15 sync); implicit caching inside batch jobs is undocumented.
How long do LLM batch jobs actually take?
All three providers quote a 24-hour processing window, and all three say most batches finish far sooner. Anthropic's documentation states most batches complete within 1 hour, with results accessible when all messages finish or after 24 hours, whichever comes first. Google says the 24-hour target is beaten in the majority of cases. OpenAI says batches often complete more quickly than the 24-hour window. Design for the 24-hour ceiling; enjoy the typical sub-hour turnaround.
What workloads should I move to a batch API?
Anything that tolerates up to 24 hours of latency: large-scale evals, embeddings and metadata backfills, document processing pipelines, classification and moderation backlogs, nightly or offline AI agent steps, and synthetic data generation. What doesn't belong in a batch: anything a user is waiting on — chat, search, copilots, or any interactive flow.