SEEKING DESIGN PARTNERS We are early stage, with a working product seeking Design Partners at no cost while we refine the platform in exchange for honest feedback.

Become a Design Partner
AI Cost Management

Observability ≠ Intelligence: Why Your LLM Monitoring Tool Won’t Save You Money

You can see every token. You can trace every call. And you’re still overspending by 30–40%. Here’s why dashboards don’t equal decisions.

22 March 20269 min read

Your engineering team has done everything right. They’ve deployed an LLM observability platform. Every API call is traced. Every token is counted. Latency percentiles are on a Grafana board. Cost per model is in a spreadsheet somewhere.

And your AI spend is still growing 15–20% month-on-month with no clear explanation of where the money is going or what to do about it.

This is the observability trap. The entire LLM tooling market has converged on a single promise: visibility. And visibility is genuinely valuable. But somewhere along the way, the industry started treating “you can see what’s happening” as synonymous with “you can fix what’s happening.” They are not the same thing. Not even close.

The Dashboard Comfort Blanket

Here’s a scenario that plays out in every enterprise we talk to.

A platform team deploys an observability tool. Could be Langfuse, Portkey, Datadog’s LLM module, or a homegrown stack built on OpenTelemetry. Within a week, they have dashboards. Beautiful, real-time dashboards showing token counts, latency distributions, error rates, and cost breakdowns by model.

The CTO sees the dashboard and feels reassured. “We have visibility.” The CFO gets a monthly export and sees total spend. The engineering team has traces they can use to debug production issues.

Everyone is satisfied. Nobody is optimising.

Because the dashboard answers a very specific question: “What happened?” It does not answer the question that actually saves money: “What should we do differently?”

Knowing that you spent $14,200 on GPT-4o last month is information. Knowing that $5,400 of that was on classification tasks that a model costing 95% less could handle identically, that’s intelligence. And no observability dashboard on the market generates that second insight automatically.

The Three Camps (and Their Blind Spots)

The LLM operations tooling market has split into three categories, each solving a genuine problem and each stopping short of the one that matters most to the business.

Camp 1: Traditional APM platforms: Datadog, New Relic, and Dynatrace have all added LLM monitoring modules. They track tokens and latency alongside your existing infrastructure metrics. Excellent for correlating AI performance with system health. But they treat LLM calls the same way they treat database queries: something to monitor, not something to strategically route or optimise. They’ll tell you a call was slow. They won’t tell you it was expensive for no reason.

Camp 2: AI-native tracing tools: Langfuse, LangSmith, Arize Phoenix. These go deeper on trace capture, prompt versioning, and evaluation workflows. They’re invaluable for debugging agent chains and measuring output quality. But their cost features are descriptive, not prescriptive. You get cost-per-trace. You don’t get “this trace should have been routed to a different model.”

Camp 3: AI Gateways: Portkey, Helicone. These sit between your application and model providers, handling routing, caching, and failover. Observability comes built-in. The primary job of a gateway is reliability and access. Routing is based on uptime and fallback rules, not cost-per-task intelligence. The cost tracking tells you what you spent. It doesn’t tell you what you should have spent.

All three camps are useful. We’re not suggesting you rip any of them out. But none of them close the loop between seeing your spend and reducing it. That gap is where real money lives.

What the Gap Looks Like in Practice

In our first design partnership, we plugged into their existing observability stack and ran our cost intelligence layer on top.

They had full tracing. They had cost dashboards. They had a monthly report that went to finance.

Within the first analysis cycle, we identified 38% in addressable savings that their existing tooling had never surfaced. Not because the data wasn’t there, but because no tool was asking the right questions of it.

The savings fell into three categories their dashboards couldn’t detect:

Model-task mismatch. 62% of their API calls were hitting a frontier model for tasks that a model costing 10–15× less could handle at equivalent quality. Their observability tool showed the cost per call. It never flagged that the call didn’t need to be that expensive.

Context bloat at the prompt level. Average input token counts were 3.4× higher than necessary. Developers were passing full documents into context windows when targeted excerpts would produce identical outputs. The tracing tool recorded the token count. It never suggested the token count was wasteful.

Redundant processing across teams. Three business units were independently calling the same model for overlapping use cases with no shared prompt library and no caching strategy. Each team’s spend looked reasonable in isolation. In aggregate, the duplication was costing $4,800 per month.

Their monitoring was perfect. Their visibility was 100%. Their waste was 38%. Observability without intelligence is an expensive illusion of control.

Observability vs. Intelligence: The Capability Gap

This isn’t a criticism of observability platforms. Portkey’s gateway is excellent infrastructure. Langfuse’s tracing is best-in-class for debugging. Datadog’s LLM module makes sense if you’re already in their ecosystem. Each solves the problem it was designed for.

The issue is category confusion. Teams assume that because they can see their LLM spend, they’re managing it. The tooling market has reinforced this assumption by positioning dashboards as the end state rather than the starting point.

Observability tools give you token counting, cost tracking, latency monitoring, and trace capture. What they don’t give you: model-task mismatch detection, routing recommendations per prompt type, context efficiency scoring, cross-team duplication analysis, savings quantification in dollar terms, or an actionable optimisation playbook. That’s the gap. And that gap is where the money is.

Why This Matters Now

Twelve months ago, most enterprises were running one or two models in production. The cost was high but predictable. A single engineer could eyeball the bill and spot anomalies.

That world is gone.

Today, the average enterprise AI deployment involves multiple model providers, multiple business units, autonomous agents making recursive calls, and a growing zoo of use cases from simple classification to complex multi-step reasoning. The surface area for waste has expanded by an order of magnitude.

In this environment, observability is table stakes. It’s the smoke detector. Necessary, non-negotiable, but not a fire suppression system. You need it, and then you need the layer that actually acts on what it finds.

That layer is cost intelligence: the ability to automatically classify every LLM interaction by task type, compare the cost of the model used against the cheapest model capable of equivalent output, quantify the savings opportunity in dollar terms, and surface those recommendations to the teams and leaders who can act on them.

The Compounding Cost of Inaction

Here’s what makes this urgent. LLM costs don’t stay flat. They compound.

Every new use case that goes into production inherits the defaults of the ones before it. If your first chatbot used GPT-5.4 because nobody evaluated alternatives, your second chatbot will too. And your third. By the time you have fifteen use cases in production, the model-task mismatch isn’t 38%, it’s embedded in your architecture.

The organisations that build cost intelligence into their AI platform now will compound savings. The ones that wait until the CFO demands answers will spend six months retrofitting what should have been a foundational layer.

We’ve seen this pattern before. It happened with cloud computing. The companies that adopted FinOps early saved millions. The ones that didn’t spent two years cleaning up sprawl after the fact. AI cost intelligence is the FinOps moment for LLMs. The window to get ahead of it is now.

What to Do on Monday Morning

If you have an observability tool deployed, you’re not starting from zero. You have the data. The question is whether anyone is turning that data into decisions.

Step 1: Audit your model-task alignment. Pull a sample of 500 API calls from last week. For each one, ask: did this task require a frontier model? If a model 10× cheaper could produce the same output, flag it. If more than 40% of your calls are flagged, you have a mismatch problem that dashboards will never surface on their own.

Step 2: Measure context efficiency. Look at your average input token count relative to the output. If you’re consistently sending 8,000 input tokens to get 200 output tokens, your prompts are carrying dead weight.

Step 3: Map spend to teams and use cases. Most observability tools can attribute cost to an API key or a project tag. Use that. If you find three teams independently spending $1,500/month on overlapping use cases, consolidation alone will save you $3,000/month before you touch a single prompt.

Step 4: Quantify the gap. Take your total monthly LLM spend. Multiply it by 0.35. That’s a conservative estimate of the savings sitting in your existing architecture, invisible to every dashboard you have running today.

Then ask yourself: is that number large enough to justify building a proper cost intelligence layer?

For every enterprise we’ve spoken to, the answer has been yes.

Want to benchmark your organisation's AI adoption?

PromptLeash can calculate your AIM Score and show you exactly where AI adoption is thriving, and where it's stalling.

Get Your AIM Score
PromptLeash · AI Impact Metric