Every other comparison of LLM cost tracking tools is written by someone who sells one. We don't — LeanLM is an optimization layer that works on top of whichever tracking tool you already have. So this is the comparison you actually want: the trade-offs, the pricing gotchas, and the honest answer to "which one should I use."

The short version: tracking tools fall into three layers, and most teams at scale need one from Layer 1 and one from Layer 2. The layer you're missing is usually the one where your cost surprises are hiding.

LeanLM (not affiliated with Google's LearnLM educational AI) is an LLM cost optimization platform. This post covers the tools that track costs — a separate problem from the tools that reduce them.

The 3-Layer Model

Layer 1: Gateway / Proxy — enforce budget limits before tokens are spent. Tools: LiteLLM, Helicone, Portkey.
Layer 2: Observability / Tracing — trace every call, attribute cost to users and features after the fact. Tools: Langfuse, LangSmith, Braintrust, Datadog.
Layer 3: FinOps / Billing — cross-cloud cost allocation and chargeback reporting. Tools: Vantage, CloudZero, OpenMeter. Needed when LLM spend exceeds ~$50K/month and finance teams get involved.

Diagram showing the 3 layers of LLM cost tracking: Layer 1 Gateway/Proxy (LiteLLM, Helicone, Portkey) enforces budgets before tokens are spent; Layer 2 Observability/Tracing (Langfuse, LangSmith, Braintrust, Datadog) traces every call and attributes cost; Layer 3 FinOps/Billing (Vantage, CloudZero, OpenMeter) handles cross-cloud allocation above $50K/month
Most teams only need Layers 1+2. Layer 3 is a FinOps concern — add it when LLM spend exceeds $50K/month and finance teams need chargeback reporting.
3 layers
where LLM cost tracking happens — gateway, observability, and FinOps. Provider dashboards (OpenAI's Usage page, Anthropic Console) only show aggregate spend, with no per-user, per-feature, or per-prompt attribution — which is why most teams add a dedicated tool.

TL;DR Comparison Table

Tool Layer Open Source Free Tier Paid Starts At Best For
Langfuse Observability Yes (self-host) 50K obs/mo $59/mo Self-hostable tracing + attribution
Datadog Observability No No ~$31/host/mo + usage Existing Datadog users
Helicone Gateway Yes (self-host) 10K req/mo $80/mo Analytics-first gateway
LangSmith Observability No Yes (limited) $39/user/mo LangChain/LangGraph users
Braintrust Observability + Evals No Yes (limited) Usage-based Continuous evals + cost
LiteLLM Gateway Yes (self-host) Unlimited (self-host) $0 (hosted: $49/mo) Multi-model routing + budgets
Portkey Gateway Yes (self-host) 10K req/mo $49/mo Reliability controls + cost

Pricing as of June 2026. Self-hosted versions of open-source tools are free but require your own infrastructure. LangSmith Plus is $39/user/month + $0.0036/minute standby on Cloud.

Layer 1: Gateway / Proxy Tools

Gateway tools sit between your application code and the LLM API. Every request passes through them. This gives them the power to do things no observability tool can: enforce budget limits before a request fires, route to fallback models when a provider is down, and cache repeated prompts to avoid billing entirely. The cost data they produce is exact — they see the raw token counts before any provider-side summarization.

Free

LiteLLM

LiteLLM is the most widely deployed open-source LLM gateway. It normalizes 100+ models — OpenAI, Anthropic, Gemini, Mistral, Bedrock, Azure, and more — to a single OpenAI-compatible API, meaning you change one line of code and gain routing flexibility across the entire LLM landscape.

For cost tracking specifically, LiteLLM offers: per-model cost calculation using a built-in pricing table (updated regularly), per-user and per-team budget enforcement with hard limits and soft alerts, detailed spend analytics at the model and user level, and SQLite/PostgreSQL storage for historical cost data.

Cost tracking highlights: Set max_budget per user key. Track spend via the /spend/users and /spend/models endpoints. Integrates with Prometheus for infra-level cost dashboards. Logs to Langfuse, Helicone, or Datadog if you want trace-level detail alongside the proxy metrics.

Pricing: Open-source and free to self-host. LiteLLM Hosted (managed) starts at $49/month. The Proxy Server is MIT-licensed.

Who should use it: Multi-model teams who need routing + fallbacks + budget enforcement as a unified layer. Excellent choice as the gateway leg of a gateway + observability stack.

Watch out for: Self-hosting adds ops burden. The analytics UI is functional but less polished than dedicated observability tools. For trace-level debugging, pair with Langfuse or Braintrust.

10K req/mo free

Helicone

Helicone is an analytics-first LLM gateway. The integration is famously simple: add one line changing your base URL (or a single header for non-proxy setups), and every LLM call is logged automatically with cost, latency, token counts, and the full prompt/response.

Unlike LiteLLM, which is primarily a routing layer, Helicone leads with analytics. Its dashboard surfaces cost trends by model, time, user, and property (custom metadata you attach to requests). You can filter to any segment — "show me cost for user_id=123 on Claude-3-Opus over the last 30 days" — without any custom instrumentation beyond adding a metadata header.

Cost tracking highlights: Automatic cost calculation across all major providers. Custom properties for per-user, per-feature attribution. Rate limiting and budget alerts. Prompt management with A/B cost comparison. Open-source worker you can self-host on Cloudflare Workers.

Pricing: Free tier: 10,000 requests/month. Pro: $80/month for 1M requests. Enterprise: custom pricing. Self-hosted version is free.

Who should use it: Teams who want detailed LLM analytics with minimal integration work. Strong choice for product teams iterating fast — the prompt playground and cost comparison features make it easy to evaluate prompt changes before deploying.

Watch out for: Routing and fallback features are lighter than LiteLLM. For multi-model routing at scale, LiteLLM is more complete.

10K req/mo free

Portkey

Portkey is a production AI gateway that combines cost tracking with reliability engineering: automatic fallbacks, load balancing across providers, semantic caching, and canary deployments for prompt changes. For teams where availability and cost are coupled concerns, Portkey addresses both in one layer.

Cost tracking in Portkey is built around "virtual keys" — per-user, per-app API key wrappers that carry budget limits, model access controls, and spend tracking. You create a virtual key for each user or team, set a maximum spend, and Portkey hard-blocks requests when the limit is hit. Cost attribution is automatic — every request is tagged to its virtual key.

Cost tracking highlights: Per-user virtual keys with hard budget limits. Cost breakdowns by model, virtual key, and time window. Semantic caching to avoid re-billing identical prompts (cache hit = $0 cost). Real-time spend dashboards. Audit logs for every request.

Pricing: Free tier: 10,000 requests/month. Pro: $49/month for 2M requests. Business: $399/month. Enterprise: custom.

Who should use it: Production teams who need reliability (fallbacks, load balancing) alongside cost tracking. Particularly useful for multi-tenant applications where each customer needs an isolated budget.

Watch out for: Semantic caching requires careful prompt normalization to hit reliably. The feature set overlaps heavily with LiteLLM — evaluate both if you're choosing between gateway tools.

Early Access

Know What You're Spending — and Cut It

LeanLM layers on top of your existing tracking stack to validate optimization changes against your production data. See which tool + optimization combination works on your actual traffic before committing.

You're on the list. We'll be in touch.

No spam, ever.

Layer 2: Observability / Tracing Tools

Observability tools don't sit in the hot path — they receive trace data after the fact, often via an SDK or async export. This means they add zero latency to your LLM calls and can be added at any time without rearchitecting. The trade-off: they can't enforce real-time budget limits the way a gateway can. They tell you what happened and why; the gateway is what stops it from happening again.

Open source

Langfuse

Langfuse is the most widely adopted open-source LLM observability platform. It provides trace-level visibility into every LLM call: input tokens, output tokens, cost (calculated from its model pricing table), latency, user metadata, and the full prompt/completion chain. For multi-step agents and chains, it renders a visual timeline showing how cost and latency decompose across sub-steps.

Cost tracking is first-class in Langfuse. You can view cost by model, by user, by session, or by any custom tag you attach at trace time. The cost dashboard shows daily/weekly/monthly trends with model-level breakdowns. If a new prompt version costs 30% more, you'll see it in the cost trend before it shows up on your API invoice.

Cost tracking highlights: Automatic cost estimation for all major models (you can override with custom pricing). Cost aggregation by user, session, or any custom tag. Prompt management with cost comparison across versions. Evaluation metrics co-located with cost data. Python/TypeScript SDK + OpenAI drop-in wrapper + LangChain integration.

Pricing: Self-hosted: free (MIT license, runs on Docker). Cloud: free tier (50,000 observations/month), Hobby ($59/month), Pro ($119/month), Team ($299/month). Enterprise: custom with SSO and SLA.

Who should use it: The default choice for teams that want trace-level cost visibility without vendor lock-in. Self-hosting on your own infrastructure means no per-observation cost as you scale, and you own the data. Strong choice for any team that isn't already standardized on Datadog or LangSmith.

Watch out for: Self-hosting adds ops burden. The built-in evals are lighter than Braintrust. Langfuse doesn't enforce real-time budget limits — pair with LiteLLM or Portkey for that.

$39/user/mo

LangSmith

LangSmith is LangChain's observability and evaluation platform. If you're using LangChain or LangGraph, it's the lowest-friction observability option by far — set two environment variables and every chain, agent, and LLM call is automatically traced with cost and latency data. There's no code change required beyond the env vars.

The LangSmith trace UI is purpose-built for LangChain's abstractions: it renders chains, retrievals, tool calls, and agent steps in a nested timeline that mirrors how LangChain actually executes them. Cost is summed at each level of the hierarchy, making it easy to identify which part of a chain is expensive.

Cost tracking highlights: Automatic tracing with zero code changes for LangChain users. Cost aggregated at chain, step, and run level. Dataset-based regression testing with cost comparison across prompt versions. Filter and search runs by model, cost, latency, or custom metadata.

Pricing: Developer (free): limited trace storage. Plus: $39/user/month. Enterprise: custom. Hosted Cloud uses $0.0036/minute standby pricing for LangGraph Cloud deployments — this can accumulate if you forget to scale down.

Who should use it: Teams using LangChain or LangGraph who want zero-friction observability. If you're not on LangChain, Langfuse or Braintrust will give you more for less money.

Watch out for: LangSmith is optimized for the LangChain abstraction layer — it's less useful if you're making raw OpenAI/Anthropic calls or using another framework. Per-user pricing scales poorly for large engineering teams. The LangGraph Cloud standby pricing is easy to forget.

Usage-based

Braintrust

Braintrust is an AI engineering platform that combines evaluation, logging, and prompt management in one product. Its positioning is slightly different from pure-observability tools: it's built for teams running continuous evals as part of their development cycle, not just monitoring in production.

Cost tracking in Braintrust comes alongside evaluation scores — you can see, for any prompt version, both the quality metrics and the cost. This is the right framing for teams optimizing the cost/quality trade-off: you're not just asking "how much did this cost?" but "how much did this cost, and was it worth it?" The experiment tracking UI makes it easy to compare prompt versions on both dimensions simultaneously.

Cost tracking highlights: Cost visible at trace and experiment level. Automatic cost calculation for all major providers. Cost comparison across prompt versions in experiment view. LLM playground with cost preview before deployment. Prompt catalog for managing versions across teams.

Pricing: Braintrust uses usage-based pricing (per log/trace event). Free tier available. Contact for enterprise pricing. The model is friendly for getting started but can scale less predictably at very high log volumes.

Who should use it: Teams who care as much about quality as cost and want to track both in the same tool. Strong choice for ML engineers running structured evals. Less necessary if you're primarily doing production monitoring without a structured eval workflow.

Watch out for: The eval focus can feel like overhead for teams that just want cost tracking. At high log volumes, usage-based pricing requires active monitoring. Less LangChain-native than LangSmith.

Enterprise

Datadog LLM Observability

Datadog added LLM Observability (GA 2025) as part of its APM platform. For teams already on Datadog, this is the path of least resistance: LLM cost data flows into the same dashboards, alerts, and on-call runbooks as your existing infrastructure metrics. When an LLM cost spike happens at 2am, your on-call engineer already knows how to use the tool that's paging them.

The integration captures token counts, cost estimates, latency, error rates, and custom evaluation metrics for OpenAI, Anthropic, Cohere, and other providers. Auto-instrumentation covers the most common Python and JS clients — no SDK integration required for the basics.

Cost tracking highlights: Auto-instrumentation for OpenAI, Anthropic, Cohere, and LangChain. Cost and latency SLOs in the same Datadog interface as infra SLOs. Anomaly detection on spend using Datadog's existing ML alerting. Trace correlation — link an LLM cost spike to the specific backend service or deployment that caused it.

Pricing: Datadog is the most expensive option on this list. LLM Observability pricing is usage-based on top of existing Datadog contracts. Typical all-in cost for meaningful LLM monitoring is $500–$2,000+/month at scale. Suitable for enterprises already in the Datadog ecosystem where the incremental cost is low relative to existing spend.

Who should use it: Enterprises already standardized on Datadog where adding a new vendor is a harder sell than an incremental Datadog feature. The observability depth for LLM-specific concerns is shallower than dedicated tools, but the platform integration is unmatched.

Watch out for: Expensive to adopt fresh if you're not already a Datadog customer. LLM-specific features (prompt management, eval workflows) are less mature than Langfuse or Braintrust. Not a good choice for cost-sensitive startups.

Early Access

Already Tracking? Now Cut.

Tracking tools tell you what you spent. LeanLM tells you what you can cut — and validates the cuts on your production traffic before they ship. Join the waitlist.

You're on the list. We'll be in touch.

No spam, ever.

How to Choose: Decision Framework

Start with the question that costs you money right now:

I don't know which model or feature is driving my bill

→ Start with Helicone (lowest integration friction) or Langfuse (open-source, self-hostable). Both give you per-model, per-user cost attribution within an afternoon of setup. If you're on LangChain, use LangSmith instead — zero code changes.

I need to enforce per-user or per-tenant budget limits

→ Use a gateway tool: LiteLLM for multi-model routing + budget enforcement, Portkey if you also need fallbacks and caching, Helicone if you want analytics-first with lighter routing. Observability tools alone can't enforce hard limits — they only tell you after the fact.

I'm already on Datadog and don't want a new vendor

→ Enable Datadog LLM Observability. The LLM-specific features are shallower than dedicated tools, but the platform integration with your existing dashboards and alerts is worth the trade-off if Datadog is already your SOT.

I want to compare prompt versions on cost and quality simultaneously

Braintrust or LangSmith. Both show cost alongside evaluation metrics in experiment views. Braintrust is the stronger choice for teams with a structured eval workflow; LangSmith for LangChain users.

The Stack Most Teams End Up With

Based on what teams building at serious LLM scale actually use:

  1. One gateway (Layer 1) for real-time budget enforcement and routing: typically LiteLLM or Portkey. Teams already using a cloud provider's managed API tend to skip this layer until per-user billing becomes a problem.
  2. One observability tool (Layer 2) for trace-level cost attribution and debugging: typically Langfuse (self-hosted) for cost-conscious teams, LangSmith for LangChain shops, Datadog for enterprise. Braintrust for teams doing structured evals.
  3. Provider dashboards as a sanity check — OpenAI's Usage dashboard, Anthropic Console — but never as a primary tool (they only show aggregate spend, no per-user or per-feature attribution).

The most common gap we see: teams using only provider dashboards for months, then discovering a single feature or a single user is responsible for 60% of the spend. Any of the Layer 2 tools above would have surfaced that in the first week.

From Tracking to Optimization

Tracking tools give you the data. The next step is acting on it. Once you know where your spend is going, the most common optimization levers are:

The tracking data tells you which lever to pull. LeanLM validates the pull — running the optimization on your production traffic and measuring the actual impact before you ship.

Frequently Asked Questions

What is the best tool for tracking LLM costs?

It depends on your stack layer. For real-time budget enforcement before tokens are spent, use a gateway like LiteLLM, Helicone, or Portkey. For per-user, per-feature cost attribution with trace-level detail, use an observability tool like Langfuse, LangSmith, or Braintrust. For teams already on Datadog, its LLM Observability module adds cost tracking without a new vendor. Most teams at scale use one gateway + one observability tool in combination.

How do I track LLM costs per user or per feature?

Use an observability tool with metadata tagging. Langfuse, LangSmith, and Braintrust all support user_id and session_id metadata on traces, which lets you roll up cost by user, feature, or team. LiteLLM and Portkey also support per-customer budget limits at the gateway layer. The key is instrumenting your calls with consistent metadata from day one — retrofitting attribution is painful.

Is Langfuse free?

Langfuse is open-source and free to self-host. The cloud version has a free tier (50,000 observations/month as of 2026) and paid plans starting at $59/month for higher volume. Self-hosting is a strong option for cost-conscious teams — you pay only for the infrastructure you run on, with no per-event pricing.

What is the difference between LiteLLM and Helicone?

Both are gateway/proxy tools that sit in front of your LLM calls, but they focus on different problems. LiteLLM is primarily a unified API layer — it normalizes 100+ models to a single OpenAI-compatible interface, handles fallbacks and load balancing, and adds budget controls. Helicone is primarily an observability gateway — it logs every call with detailed analytics, cost attribution, and prompt management, with lighter load-balancing features. For pure routing and budget enforcement, LiteLLM is the stronger choice. For analytics-first logging, Helicone.

Do I need both a gateway and an observability tool?

At serious scale, yes. A gateway enforces real-time budget limits and routes traffic before tokens are spent. An observability tool gives you the post-call trace detail needed to understand why costs spiked — which user, which feature, which prompt version. A gateway alone tells you what you spent; an observability tool tells you why. They're complementary: most production teams settle on one of each.

What is LangSmith used for?

LangSmith is LangChain's observability and evaluation platform. It provides trace-level visibility into LangChain and LangGraph chains and agents — showing every LLM call, tool call, cost, latency, and input/output in a timeline view. Cost tracking is included in all LangSmith plans. It's tightly integrated with the LangChain ecosystem, which makes it the lowest-friction choice for teams already using LangChain.

How does Datadog track LLM costs?

Datadog's LLM Observability module (GA as of 2025) captures token counts, cost estimates, latency, error rates, and evaluation metrics for LLM calls. It integrates via an SDK or auto-instrumentation for OpenAI, Anthropic, and other providers. Cost data flows into the same dashboards as your existing infra metrics, making it easy to correlate LLM spend spikes with engineering incidents. The trade-off is price: Datadog is the most expensive option at scale, and its LLM-specific features are shallower than dedicated tools.

When should I add a FinOps tool on top of LLM observability?

When your LLM spend exceeds roughly $50K/month and you need cross-cloud cost allocation, chargeback reporting for internal teams, or FinOps-standard tagging for board-level reporting. Below that threshold, the cost attribution from a gateway + observability stack is usually sufficient. Tools like Vantage, CloudZero, or OpenMeter operate at the billing/cloud layer — useful for enterprise finance teams but overkill for most engineering-led cost tracking.