Cost Management for AI Development

Cost Management for AI Development

Token costs compound silently until your end-of-month invoice arrives. The decision you made on day one to dump the whole codebase into the prompt, the loop you wrote that retries five times on parse failure, the chat assistant you shipped that does not cache its system message, all of those choices keep paying out for as long as the feature is in production. By the time the bill is large enough to notice, the architectural choices that produced it are already calcified, and the question is not whether to optimize but how much rework the optimization will cost.

This page is about catching that compounding before it happens. Not as a separate optimization phase that gets scheduled after launch, but as the silent layer of every AI feature you ship. Cost management for AI development is not a finance discipline; it is an engineering one. The teams that treat it as such ship sustainably. The teams that treat it as somebody else's problem discover, three months in, that their unit economics never made sense.

~10x
Cost reduction available on cached portions of a Claude prompt versus full input rates, the largest single cost lever in agent workflows
~5x
Typical ratio of output token cost to input token cost across major providers, which makes verbose responses unusually expensive
$1,000
Plausible damage from a single runaway agent loop running unchecked for a few hours on a high-tier model with no spend limit
~70%
Share of typical agent-style spend that is input tokens, with output and overhead splitting the remainder

Where the Costs Actually Go

The first move in cost discipline is naming what you are paying for. Most teams cannot do this with any specificity, which means they cannot fix anything either. The bill arrives, somebody flags it, somebody else asks "where is it coming from?", and the answer is a vague gesture toward "the AI features." The vague gesture is the actual problem. The components of the bill are knowable, and naming them is the start of controlling them.

Input tokens are usually the largest line. Every prompt the model sees, including the system message, the user message, any reference docs, any few-shot examples, any tool descriptions, any prior conversation turns, all counts as input. Modern providers price input by the million tokens. The number sounds small until you realize that a single agent session on a real codebase routinely sends 50K to 200K input tokens per turn, and a turn happens every time the agent does something. Multiply by sessions per day, days per month, users per product, and the input bill is the load-bearing piece of the total.

Output tokens are smaller in volume but disproportionately expensive. Most providers price output at roughly 4x to 5x the input rate. Claude Sonnet, GPT-5, and Gemini Pro all sit in that ratio. The rationale is generation cost: producing a token requires a forward pass through the full model, while reading an input token can be batched more aggressively. The practical implication is that a verbose response is not just slow but actively expensive. A model that returns 5K tokens of explanation when 500 would have done the job costs you 10x more on the output side, and that overhead is invisible if you are not watching it.

Sub-agent dispatches are their own line item. When the parent agent spawns a child agent for a subtask, the child gets its own context window, its own system prompt, its own input tokens to pay for. None of the parent's context inherits by default; the child starts fresh. That is good for context hygiene and bad for naive cost estimates. The math: a parent at 50K context plus 10 sub-agents each at 30K context equals 350K input tokens for a single composite task. If you assumed the sub-agents would amortize the parent's context, you assumed wrong.

Retries quietly double the bill. Network failure, validation failure, parser failure, content-policy refusal, any reason the call has to be re-sent means you paid the original input cost and then paid it again. Most retry policies are silent: the call fails, the wrapper retries, the second call succeeds, the user sees a normal response, and the bill shows two charges instead of one. A 5% retry rate, which is mild, is a 5% cost overhead on every workflow that uses retries.

Long-running agent loops are where the compounding gets ugly. Each turn adds context. The added context is paid for on every subsequent turn. By turn fifteen, the context might be 80K tokens; the agent is paying full input rate on those 80K tokens for every remaining turn. The cost per turn climbs the longer the loop runs. The sessions that feel productive in the moment, "we are making progress, just keep going," are exactly the sessions where the per-turn cost is highest. Length is not free.

Then there is the overhead nobody wants to count. Embedding generation for retrieval. Vision input on multimodal calls. The token count of tool definitions in the system prompt. The token count of structured output schemas. Image generation. Audio transcription. None of these are large individually. Together, on a feature that uses several of them, they often account for 10% to 20% of the total bill, which is the difference between healthy unit economics and a feature that loses money on every user.

Typical breakdown of agent workload spend: input tokens ~60%
Typical breakdown of agent workload spend: output tokens ~25%
Typical breakdown of agent workload spend: sub-agent dispatch overhead ~7%
Typical breakdown of agent workload spend: retries and recoveries ~5%
Typical breakdown of agent workload spend: tool defs, schemas, embeddings ~3%

The proportions above are a rough average across well-behaved agent workloads. The exact numbers vary by provider, by feature shape, by how disciplined the team is about caching and context management. The instructive thing is the dominance of input. If you only optimize one line, optimize input volume. The lever is large and the others are smaller. That single observation, taken seriously, is responsible for most of the realized savings in the field.

One more sub-component deserves a callout because it is so often miscounted: the cost of evaluation. Teams running serious AI features instrument their outputs with judge models, regression suites, and automated quality checks. The judge calls cost real tokens. A regression suite that runs ten test prompts on every release with a judge model evaluating the outputs adds up to something on the order of $5 to $50 per run, depending on the model. Run that suite on every PR and the cost is not catastrophic but it is not zero either. Plan for it as a separate budget line, not as something hidden inside the feature's normal spend.

Model Selection by Tier

The single highest-leverage cost decision in any AI workload is which model handles which slice of the work. Most teams default to "the best model" because the best model produces the best results, which is true and also expensive. The discipline is to ask, for each call, whether the marginal quality improvement of the higher tier is worth the multiplier in cost. The honest answer is that for most calls in most workloads, a smaller model would have done the job.

The Claude family is the cleanest place to learn this discipline because the tiers are well-differentiated and the pricing reflects the differentiation. Three tiers matter.

Claude Haiku sits at the cheap end. The pricing as of 2026 is roughly $0.25 per million input tokens and $1.25 per million output tokens, give or take revisions. The model is fast, surprisingly capable on routine tasks, and the right choice for high-volume work where the per-call quality bar is moderate. Classification, simple extraction, low-stakes summarization, content moderation, the routing layer that decides which downstream model to call: all of these are Haiku territory. The gap to Sonnet is real, but on the right tasks, the gap is smaller than the price difference would suggest.

Claude Sonnet is the workhorse. The pricing sits roughly in the middle of the family, currently around $3 per million input and $15 per million output. Sonnet handles the tasks where you need genuine reasoning quality but the call is high-volume enough that the price ceiling matters. Most production agent work belongs here. Most chat-style features belong here. Most coding assistance, code review, documentation generation, structured analysis at moderate complexity, all default to Sonnet. The model is good enough to ship and cheap enough to scale.

Claude Opus is the high-stakes tier. The pricing is roughly $15 per million input and $75 per million output, the most expensive of the three. Opus earns the price tag on the tasks where the cost of being wrong dwarfs the cost of the call: architectural decisions, deep code review, hard reasoning problems, multi-step planning where each step has real consequences. Use Opus when the alternative is a human spending an hour on the task; the call is cheap by comparison. Do not use Opus for routine work; the marginal quality is small and the cost difference is large.

Claude Haiku, ~$0.25 / $1.25 per 1M tokens

Fast classification, simple extraction, content moderation, routing decisions, format conversion, high-volume low-stakes calls. The model is small enough that it sometimes flubs nuanced reasoning, which is fine when the task does not require nuanced reasoning. Use Haiku as the default for any feature where you measure throughput in calls per second. The cost ceiling lets the feature scale.

Claude Sonnet, ~$3 / $15 per 1M tokens

Production agent work, chat features, coding assistance, structured analysis, document processing, RAG-style synthesis. Sonnet is good enough for almost any task that does not require the absolute frontier of reasoning, and cheap enough to ship at scale. Most of your spend should land here. The price gap to Opus is real and almost always worth respecting.

Claude Opus, ~$15 / $75 per 1M tokens

Architectural reasoning, hard refactors, multi-step planning, deep code review, anything where being wrong costs more than the call. Opus is expensive on purpose; use it where the alternative is human time. A single Opus call that saves an hour of engineering is the cheapest line item in any technical workload. Calling Opus from a high-volume loop is the most expensive mistake you can make.

The honest comparison to other providers matters because no one runs Anthropic exclusively. GPT-5 from OpenAI offers a similar tiering, with GPT-5 mini and GPT-5 nano covering the cheap end and full GPT-5 covering the high end. The pricing is competitive and shifts month to month; as of late 2026, the headline numbers are in the same general range as Claude's, with some inversions on specific tiers. Gemini 2.5 Pro and Gemini 2.5 Flash from Google are aggressively priced for the long-context use cases, often beating both Claude and GPT on the per-token cost when the prompt is large.

The honest take: Claude wins on agent quality and prompt caching mechanics. GPT wins on raw cost at high volume for some workloads, especially when batch API access is available. Gemini wins on long-context cost. The right answer for any given feature is rarely "use one provider for everything." The right answer is to slice the workload by tier and pick the model that matches each slice.

The "smallest model that gets the right answer" rule is the operative principle. For each call, start by asking what the smallest model that produces acceptable output would be. Test it. If it works, ship it. If it does not, step up one tier and test again. Most teams skip this step entirely and default to the highest tier, which is the most common form of unnecessary spend in the field. The discipline takes one afternoon per feature and pays back every month of operation.

One nuance that traps teams: the cheapest model is not always the lowest-cost outcome. A Haiku call that requires three retries to produce acceptable output costs more than a single Sonnet call that succeeds the first time, both in tokens and in latency. The right comparison is not per-call cost but cost-per-acceptable-output, which is per-call cost divided by acceptance rate. A 95% acceptance rate at Haiku tier might still beat a 99.5% acceptance rate at Sonnet tier; a 70% acceptance rate at Haiku tier almost certainly does not.

Prompt Caching Mechanics

Prompt caching is the single largest cost lever available in modern AI development. The idea is simple: when you send a prompt that contains content you have sent before, the provider stores the processed state of that content for a short time, and on subsequent calls, you pay a fraction of the normal input cost for the cached portion. The fraction is roughly 10% on Anthropic's default tier, which is to say a 90% cost reduction on the cached content. Used correctly, prompt caching turns expensive workloads into affordable ones.

Anthropic's implementation is the cleanest to learn from. You mark cache breakpoints in your prompt structure: a stable system message, a stable set of reference documents, the agent's instructions, anything that does not change between calls. On the first call, the provider charges the full input rate but stores the processed cache. On subsequent calls within the cache TTL, the cached content is read at 10% of the input rate. The TTL is 5 minutes on the default tier and longer on premium tiers, which is the right shape for chat-style workloads where users send messages every few seconds.

The pattern that captures the maximum value: structure your prompt so the stable parts come first and the volatile parts come last. The cache works on prefix matching; the largest stable prefix is the largest cached portion. If your system prompt is 5K tokens and your user message is 500 tokens, putting the system prompt first means the 5K tokens get cached on subsequent calls. Putting the user message first means none of it does. The order matters more than most teams realize the first time they implement caching.

// Claude API call structured for maximum prompt cache hits
// (Python SDK, Anthropic; same shape across other SDKs)

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_INSTRUCTIONS,  # stable across all calls
            "cache_control": {"type": "ephemeral"},
        },
        {
            "type": "text",
            "text": REFERENCE_DOCS,  # stable; large; changes rarely
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[
        # Conversation history goes here, with the volatile
        # user input as the last message. The cache reads
        # everything up to the last cache_control breakpoint
        # and re-processes only what comes after.
        {"role": "user", "content": user_input},
    ],
)

# Subsequent calls within the 5-minute window pay roughly
# 10% of the input rate on the cached system blocks. The
# user_input is processed at full rate, which is fine
# because user_input is small.

The mechanics matter, but the workload patterns are where the savings come from. Three patterns capture most of the value.

The chat pattern. A user is having a conversation with the assistant. The system prompt is fixed. The conversation history grows. Every turn, you cache the system prompt plus the prior turns and process only the new user message at full rate. The longer the conversation, the larger the cached portion, the larger the savings. By turn ten in a long-context chat session, more than 90% of the input bytes are cached, which means more than 90% of the input cost has been cut. The same shape applies to agent sessions, where the system prompt and the project context are fixed and the agent's actions accumulate as conversation turns.

The reference docs pattern. A feature that uses retrieval-augmented generation pulls back a set of documents and includes them in the prompt. If the document set changes per query, you get no caching. If the document set is stable across many queries (top-K from a large index of slowly-changing documents), you can cache the documents and process only the question at full rate. The pattern requires some retrieval-system discipline (deterministic ordering of retrieved documents, stable formatting, cache-friendly chunking) but the payoff is large for any high-volume question-answering workload.

The instruction-heavy pattern. Some features include long instructions: detailed style guides, complex tool descriptions, multi-paragraph rubrics for evaluation tasks. The instructions are stable. The inputs are not. Cache the instructions; process the inputs at full rate. This is the workload shape where teams most often forget caching exists, because the per-call instruction overhead feels invisible until you tally it across a million calls.

Cost reduction on cached portion versus full input rate ~90%
Typical cache hit rate on a well-structured chat workload after first turn ~85%
Total cost reduction on chat-style workloads after caching is added ~70%
Cost reduction on RAG workload with stable document retrievals ~50%

The numbers above are achievable, not guaranteed. The teams that hit them have done two things: structured their prompts to maximize the cacheable prefix, and instrumented their cache hit rate so they can verify it is working. The teams that talk about caching but never measure are typically getting half the value they could be getting, because something in their prompt structure is invalidating the cache more often than they realize.

One trap to know about. The cache is invalidated by any change to the cached prefix. If your system prompt includes a timestamp, a session ID, a user ID, or anything that varies per call, the cache will not hit. The fix is structural: keep the variable content in a separate, non-cached block, and let the stable content sit in its own cached block. The prefix-matching nature of the cache means a single byte difference in the wrong place can invalidate everything that follows it.

OpenAI also offers prompt caching on GPT-5 and adjacent models, with similar mechanics and similar discounts. Gemini has its own context-caching system with a different pricing model (a separate per-hour cache fee, plus reduced input cost on cached content). The provider-specific details vary; the principle is the same. If your workload has any stable prefix, cache it. The cost difference is the difference between an affordable feature and one that has to be re-architected when it scales.

Context Window Economics

Just because you have 200K tokens of context window does not mean you should use them. The fact that the window exists is a license to fit more relevant content; it is not a mandate to fill the space. Most context-related cost overruns come from a misreading of the affordance: the team treats the window as a buffer to fill rather than a budget to spend.

The relevant cost is per-token. Every token in the prompt costs money on every call. A prompt that uses 50K of the available 200K is one quarter the input cost of a prompt that uses 200K. The model does not care whether the unused space is empty; it does not give you a discount for restraint. Restraint is the discount, paid in tokens not sent.

Sliding context is the simplest discipline that makes a difference. As a session progresses, throw away history that has stopped being relevant. The user asked about the search bug ten turns ago, the bug was fixed, the conversation moved on. The original prompt and the original investigation are now noise. Modern agent harnesses can compact older turns into summaries, or simply drop them. The implementation differs by harness; the discipline is the same. Treat conversation history as a sliding window, not an accumulating one.

Summary compression is the more aggressive version of sliding. Instead of dropping older turns, ask the agent to summarize them before continuing. A consolidation prompt looks like this: "Summarize what we have done so far, what files have changed, and what is currently in flight. Be detailed enough that we can continue from your summary." The summary is paid for once; the compressed history then takes 10% of the space the original turns took, on every subsequent turn. Net cost: a small one-time output cost, ongoing input cost reduction for the rest of the session.

Session starts: 5K cached + 0K user
Turn 5: 5K cached + 30K accumulated
Turn 10: 5K cached + 80K accumulated
Compaction: 5K cached + 8K summary
Continue: each new turn pays on 13K, not 80K

The flow above is a generic shape; the actual numbers depend on how verbose the work is. The structural point is that a deliberate compaction at the right moment turns a degrading session into a fresh one without losing the work. The cost profile changes from "rising per turn" to "flat per turn." Over a long session, that is the difference between an affordable workflow and one that triples in cost over the last few turns.

Sub-agent isolation is the third lever, and it has its own section below because the tradeoffs are subtle. The summary version: dispatch a child agent with a fresh context for any work that would otherwise bloat the parent's window. The child explores, the child summarizes, the parent receives only the summary. The exploration cost is paid once in the child's window and never accumulates in the parent's. For exploration-heavy tasks, this is the largest single context-cost lever available, larger than caching for the right workload shape.

One mental shift that helps: think of the context window as a resource with a market price, not as a feature you have access to. The 200K tokens are not free. They are billed every time you use them. A session that holds 100K of context for 20 turns has paid for 2M input tokens of context alone, even if the user never typed more than a few hundred tokens of fresh content. The bigger the window, the more expensive the failure mode of "fill it because you can." Use what you need. Trim what you do not. Pay for signal, not for buffer.

The corollary applies at the design stage too. When you are deciding how a feature will work, model the context cost explicitly. If the feature will routinely send 80K tokens of context per call, the per-call cost is going to be roughly four times what it would be at 20K. Both numbers are technically affordable; one of them is sustainable and the other is not, depending on the call volume. The choice of how much context to send per call is one of the most consequential design decisions you will make for any AI feature, and it is rarely treated with the seriousness it deserves.

Sub-Agent Dispatch Costs

Sub-agent dispatch is the second-largest cost lever in agent workflows, behind prompt caching. The mechanics are simple: instead of doing exploration-heavy work in the main session, the agent spawns a child with a focused task and a limited budget, the child does the work, the child returns a summary, and the parent receives only the summary. The exploration tokens never accumulate in the parent's context.

The cost math has two sides, and ignoring either side gets you the wrong answer. On one side, the sub-agent does its own work in its own context window. That is real cost: the child has a system prompt, the child reads files, the child reasons, the child generates output. None of those tokens are free. On the other side, the sub-agent's work does not pollute the parent's context. The parent does not pay for those tokens on subsequent turns. For any task where the alternative is "this exploration sits in the parent's window for the next twenty turns," the sub-agent saves the exploration cost times twenty, minus the one-time cost of spinning up the child.

The breakeven calculation. If the exploration would have produced 30K tokens of context in the parent's window, and the parent has 15 more turns to run, the parent would pay for those 30K tokens on every one of the 15 turns. That is 450K input tokens over the rest of the session, paid at full input rate (or 10% if cached, but cached only after first hit). The same exploration in a sub-agent costs roughly the 30K tokens the child consumes, plus the small cost of the summary the child returns to the parent. Net savings are large for any non-trivial session length.

Direct exploration in parent

Parent reads 20 files searching for instances of a pattern. The 30K tokens of file content sit in parent's context for the rest of the session. Over the next 15 turns, that 30K is paid at the input rate every turn, totaling 450K input tokens of stale exploration. The parent's context is also closer to its limit, which means the model's attention is more diffuse and the output quality drops over the rest of the session.

Sub-agent exploration

Parent dispatches a child agent with a focused task. Child reads the same 20 files, builds the answer, returns a 500-token summary. Parent's context grows by 500 tokens, not 30K. The child's exploration cost is paid once. Over the same 15 remaining turns, the parent pays for 500 tokens, not 30K. Total spend: child's one-time exploration cost (~30K) plus 500 tokens times 15 turns (~7.5K) versus the direct cost of 450K. Roughly an 11x reduction.

The pattern works because the asymmetry is large. Exploration is high-volume input. The answer is low-volume output. The sub-agent absorbs the input volume; the parent receives only the output. For any task where this asymmetry holds, sub-agents are the closest thing to a free lunch in cost engineering. For tasks where the asymmetry does not hold, sub-agents are pure overhead.

When sub-agents are worth it. Code search across many files. Reading a large reference document to extract a few facts. Investigating a bug that requires reading the call graph. Cross-referencing implementations across a dozen modules. Anything where the cognitive payload is "go find this out" and the deliverable is a paragraph or less. Anything where the parent's context budget is tight and the exploration would push it past comfortable utilization.

When sub-agents are not worth it. Small tasks the parent could do in one or two file reads. Tasks where the conversation with the user is the substance of the work and dispatching a child loses the conversation. Tasks where the parent already has the relevant context and the child would need it sent over again, doubling the cost. Tasks where the latency of dispatching a child outweighs the cost savings, especially in interactive workflows where the user is waiting.

The default rule that works for most agent harnesses: dispatch a sub-agent for any exploration that would produce more than a few thousand tokens of file content or tool output in the parent's context. Below that threshold, the savings are too small to justify the extra round trip. Above it, the savings compound for the rest of the session.

One trap to watch. Some workflows use sub-agents recursively. The parent dispatches a child, the child dispatches its own child, and so on. Each level pays its own startup cost and contributes its own context consumption. Two or three levels deep is a sensible limit; deeper than that, the dispatch overhead starts to dominate. If you find yourself building tree-shaped agent compositions with high branching factors, instrument the per-level cost and check that the tree is not paying more in dispatch than it saves in context isolation.

Pricing Comparison Across Providers

No serious team runs only one provider in 2026. The pricing landscape rewards picking the right model for each slice of work, and the right model is rarely all from the same vendor. A working comparison of the major options, with honest takes on where each one earns its place.

Anthropic's Claude family was covered in the model selection section. The headline numbers: Haiku at roughly $0.25 / $1.25 per million input/output, Sonnet at roughly $3 / $15, Opus at roughly $15 / $75. Prompt caching is mature and discounts cached input by 90%. The agent ergonomics are best-in-class: Claude Code is the reference agent harness, the API is well-shaped for tool-use loops, the models handle multi-step reasoning more reliably than most. The trade-off is that pure cost-per-token is not the cheapest at any tier; you pay a small premium for the agent quality. For agent-heavy workloads, the premium pays for itself in fewer retries and shorter sessions. For high-volume non-agent workloads, you can sometimes do better elsewhere.

OpenAI's GPT-5 family is the obvious comparison. The pricing as of late 2026 sits in a similar range across tiers, with GPT-5 nano covering the cheap end (similar to Haiku), GPT-5 mini covering the middle (similar to Sonnet), and GPT-5 covering the high end (competitive with Opus, sometimes cheaper). OpenAI offers a Batch API that discounts async work by 50%, which is a substantial advantage on workloads that can tolerate delayed responses (overnight processing, bulk re-classification, evaluation runs). Caching is available with similar mechanics to Anthropic's, with smaller discounts on the default tier. The model family is strong on raw reasoning; the agent ergonomics have caught up considerably with Codex CLI, but Claude Code still leads on agent quality in most benchmarks worth taking seriously.

Google's Gemini 2.5 family is competitively priced, especially on long-context. Gemini 2.5 Flash sits in Haiku territory on price; Gemini 2.5 Pro sits between Sonnet and Opus on price with a different quality profile. Gemini's signature advantage is the 2M-token context window combined with aggressive long-context pricing, which makes it the right pick for any feature that genuinely needs to fit a million tokens of content into a single call. Multimodal capabilities are mature. Caching is available with a different pricing structure (a per-hour cache fee plus reduced input cost). For workloads dominated by huge prompts, Gemini often wins on absolute cost.

OpenRouter is the aggregator that matters. It is not a model provider; it is a single API that routes to many providers. The value: one API key, one billing account, the ability to swap models without rewriting your integration, and pricing that is typically a small markup over the underlying provider's direct rates. For teams that want to compare providers without committing to one, or that want fallback routing when a specific provider is degraded, OpenRouter is the easiest entry point. The downside is that some provider features (like Anthropic's caching breakpoints, or OpenAI's Batch API) are not always exposed cleanly through the aggregator. Use OpenRouter for flexibility; use direct provider APIs when you need provider-specific features.

Per 1M input tokens, Claude Haiku $0.25
Per 1M input tokens, Claude Sonnet $3
Per 1M input tokens, Claude Opus $15
Per 1M output tokens, Claude Haiku $1.25
Per 1M output tokens, Claude Sonnet $15
Per 1M output tokens, Claude Opus $75

The relative scale is more durable than the specific numbers. The exact prices will be slightly different by the time you read this. The structural truth is that the gap between tiers is roughly a 10x to 12x multiplier from cheap to expensive within a family, and the gaps between providers at the same tier are typically a few percent to a few tens of percent, not multiples. Optimizing within a tier (provider A versus provider B at the middle tier) gets you small percentages. Optimizing across tiers (using cheap tier for the right work, mid tier for the rest) gets you multiples. Spend the optimization budget on tier-matching first, provider-shopping second.

The honest bottom line on which provider to pick. Claude wins on agent quality and prompt caching; default to Claude for any workflow where the agent loop is the main shape. GPT wins on raw cost at high volume when the Batch API is available; pick GPT for batch processing, evaluation runs, and bulk re-classification. Gemini wins on long-context economics; pick Gemini when the prompt is genuinely millions of tokens. OpenRouter wins on flexibility; pick it when the workload changes shape often or when fallback routing matters more than per-call optimization. Most serious products end up running multiple providers, which is fine; pick by workload shape, not by allegiance.

Monitoring and Alerting

You cannot manage what you do not measure. The first move in any AI cost discipline is making the spend visible, and the second move is making it actionable. Most teams skip the second move, which means they have a dashboard that shows them the bad news after the bad news is already too large to ignore.

Provider dashboards are the baseline. The Anthropic Console shows real-time spend by API key, by model, and by date range. The OpenAI Platform offers similar visibility, with usage breakdowns by project and by model. Google AI Studio and Google Cloud Console expose Gemini usage at varying granularities. OpenRouter has its own dashboard for spend across underlying providers. Every team running AI workloads should have at least one team member checking the relevant dashboards weekly, ideally daily during the first month of any new feature's deployment.

The dashboards are necessary but not sufficient. They tell you what happened, after it happened. Hard limits and alerts are what prevent the next surprise.

Set hard spend limits at the API key level. Anthropic supports per-key budget caps. OpenAI offers project-level spend limits that hard-cap the bill. Configure them. The right ceiling is some multiple of expected spend, large enough that legitimate spikes do not trigger it, small enough that a runaway loop hits the cap before the cap becomes meaningful damage. A reasonable default: 3x expected monthly spend per key. If a feature normally bills $500 a month, set the cap at $1,500. If something goes wrong, the bleeding stops at $1,500 instead of $15,000.

Alerts at 50%, 80%, and 100% of budget are the standard pattern. The 50% alert is informational ("we are tracking on the high side this month"). The 80% alert is a warning ("something needs attention"). The 100% alert is a crisis ("the budget is gone, calls are about to start failing"). Wire these into the team's existing notification channel. The alerts are uninteresting most of the time, which is the point; they only fire when something has gone wrong, and when they do, you find out before the customer does.

The runaway loop detector deserves its own callout. An agent that gets into a loop, repeatedly calling itself or repeatedly retrying a failed step, can spend $1,000 in a few hours on a high-tier model with no spend cap. The pattern is rare but not theoretical; it has happened to enough teams that the pattern has a name. The defenses are mechanical: hard turn limits per session, hard cost limits per session, automatic termination when an agent has been running longer than expected, monitoring for unusual call rates from a single key. Build at least one of these into any agent harness running on your infrastructure. The cost of the safety net is negligible. The cost of not having it is sometimes measured in thousands of dollars per incident.

1
Set per-key spend caps

In your provider's dashboard, configure a hard ceiling on every API key. Use 3x expected monthly spend as the default. Calls fail when the cap is hit, which prevents catastrophic overruns from runaway loops or compromised keys. Production keys should always have caps; experimental keys should have lower caps still.

2
Configure budget alerts

Wire alerts at 50%, 80%, and 100% of expected monthly spend to a channel the team actually reads (Slack, email, PagerDuty depending on severity). The 50% alert is informational; the 80% alert is a warning; the 100% alert is a crisis. Alerts are silent most of the time, which is the point.

3
Instrument per-feature spend

Tag every API call with the feature it serves (header metadata, separate API keys per feature, or a custom tracking layer). When the bill spikes, you want to know which feature caused it without having to guess. Per-feature visibility also lets you compute cost-per-user-per-feature, which is the input to the cost-per-feature framework below.

4
Build a runaway loop detector

In your agent harness, add hard limits: max turns per session (typically 50 to 100), max cost per session (typically $5 to $50 depending on workload), max wall-clock duration per session (typically 30 minutes). When any limit is hit, terminate cleanly and log the cause. The cost of the safety net is small; the cost of an unbounded loop is sometimes thousands of dollars.

5
Review weekly during the first month

For every new feature, schedule a weekly cost review for the first month after launch. Open the dashboard, look at the per-feature spend, compare to expected, investigate any anomalies. After the first month, monthly reviews are usually sufficient. The discipline catches drift while it is still small.

6
Track cache hit rate explicitly

If you implemented prompt caching, the success of the implementation is visible in the cache hit rate. Most providers expose this in API response metadata or in the dashboard. A cache hit rate below 70% on a workload that should hit 90% means something is invalidating the cache. Catch the regression early; the cost difference between hitting and missing is roughly 10x on the affected tokens.

One pattern that pays back the time investment: build a simple internal dashboard on top of the provider data. The provider dashboards are good for raw numbers but bad for the questions you actually want to ask, like "what does feature X cost per user per day?" or "is the new model selection rule producing the savings we expected?". A small Streamlit or Grafana dashboard wired to the provider's usage API can answer those questions in a way the provider's UI cannot. The investment is a few days of engineering time. The return is making cost discipline a feature of the team's workflow rather than a memo someone sends quarterly.

The instrumentation philosophy is: the more visible the spend, the more controllable the spend. Teams that work in the dark guess at where the cost is coming from. Teams that work in the light fix the actual problem on day one instead of debugging it on day thirty.

Architectural Choices That Cut Costs By 10x Without Quality Loss

The biggest savings in production AI workloads come from architecture, not from haggling over per-token pricing. Five architectural choices, applied together, routinely cut costs by 10x compared to a naive implementation, with no detectable quality loss. Each one is independently worth doing; the compounding is what makes them powerful when applied as a set.

Prompt caching is the 80% lever. Anything stable in the prompt belongs in a cached portion. The system message, the tool descriptions, the reference docs, the agent's instructions, the schema for structured output. All of it. The discipline is to look at every prompt and ask which parts change between calls and which parts do not. The parts that do not change get cached. The parts that change go after the cache breakpoint. This single discipline often cuts the input bill by 60% to 80% on workloads that have any stable prefix at all, which is most workloads.

Cheaper model for the high-volume slice. Almost every AI workload has a tiered structure, where a small percentage of calls require the highest-quality model and the rest can be served by a cheaper one. The naive implementation uses the highest-quality model for everything. The disciplined implementation routes by complexity: a fast classification model decides which tier to use, the call goes to the right tier, and the high-tier model handles only the calls that justify the cost. Building the routing layer is one afternoon of engineering. The savings are typically 3x to 5x on the workload as a whole, because the high-tier traffic was usually overkill for most of what it was handling.

Truncate context aggressively. Most prompts contain more than the model needs. The first 500 lines of a file when the model only needs the function signature. The full conversation history when only the last three turns matter. The complete document when only one paragraph is relevant. The discipline is to ask, for every chunk of context, whether the model would produce a worse answer without it. If not, cut it. The savings on a heavy-context workload can be substantial, and the quality often improves because the model's attention is more focused.

Cache structured outputs. If the same input reliably produces the same output (a deterministic classification, a stable extraction from a stable document, a memoizable transformation), do not call the model twice. Cache the output. The cache is yours, not the provider's; build it as a key-value store keyed on the input hash, persist it as long as the underlying input is valid, look up the cached output before falling back to the model. The cache hit rate on the right workloads is high enough that the model call rate drops by 50% or more, which is a 50% cost reduction on the affected calls.

Batch eligible requests. Workloads that do not need real-time responses (overnight evaluation, bulk re-classification, periodic summarization, scheduled scoring) should use the batch APIs where available. OpenAI's Batch API discounts by 50% in exchange for a 24-hour SLA. Anthropic offers batch processing with similar discounts. Gemini has an analogous feature. The price reduction is large, the engineering cost of routing async work to batch is small, and the workload shape determines whether it applies. For any feature that produces output the user does not need within a few minutes, batch is the default.

The 5-bullet checklist that cuts costs 10x

Cache the stable prefix of every prompt. Route to the cheapest model that produces acceptable output. Truncate context to what the model genuinely needs. Cache deterministic outputs in your own key-value store. Batch any work that does not need real-time response. Apply all five together. The compounded savings on a typical AI workload are 8x to 12x with no measurable quality loss, and most teams do not implement any of them on the first version of a feature.

The reason these patterns work in combination: they each address a different cost dimension. Caching addresses the fixed-prefix overhead. Model routing addresses the per-call cost. Truncation addresses the volume per call. Output caching addresses the call rate. Batching addresses the per-call price. The dimensions are independent, which means the savings multiply rather than add. A workload that achieves 50% reduction on each of two dimensions is at 25% of original cost, not 50%. Three dimensions at 50% each is 12.5%. The non-linearity is what produces the 10x outcomes.

The implementation cost is real but bounded. Adding all five patterns to an existing feature is typically two to three weeks of engineering for a senior team. The savings start the day the code ships and continue for the life of the feature. For any workload that bills more than a few thousand dollars a month, the payback is under two months. For any workload that bills more than $50K a month, the payback is days. The math is hard to argue with once you sit down and do it.

The Cost-Per-Feature Framework

The discipline that ties cost management to product reality is computing cost-per-feature explicitly. Most teams have a vague sense of what their AI features cost; few teams can tell you, for any specific feature, what it costs per user per month. The vague sense is what gets you in trouble. The explicit number is what gets you out of it.

The formula is mechanical. For each AI feature, estimate calls per user per day. Estimate input tokens per call (after caching, so use the effective rate, not the gross rate). Estimate output tokens per call. Multiply by the per-token costs at the relevant tier. Multiply by user count. The result is the monthly cost for the feature at current scale, broken down by component.

# Cost-per-feature calculation, generic shape

# Per-call inputs (after prompt caching)
cached_input_tokens = 5000          # cached portion (90% discount)
fresh_input_tokens = 800            # fresh portion (full rate)
output_tokens = 600

# Provider rates (Claude Sonnet, illustrative)
RATE_INPUT = 3.00 / 1_000_000       # $3 per 1M input tokens
RATE_INPUT_CACHED = 0.30 / 1_000_000  # $0.30 per 1M cached input
RATE_OUTPUT = 15.00 / 1_000_000     # $15 per 1M output tokens

# Cost per call
cost_per_call = (
    cached_input_tokens * RATE_INPUT_CACHED
    + fresh_input_tokens * RATE_INPUT
    + output_tokens * RATE_OUTPUT
)
# = (5000 * 0.0000003) + (800 * 0.000003) + (600 * 0.000015)
# = 0.0015 + 0.0024 + 0.009
# = $0.0129 per call

# Volume assumptions
calls_per_user_per_day = 8
days_per_month = 30
users = 5000

# Monthly cost for the feature
monthly_cost = (
    cost_per_call
    * calls_per_user_per_day
    * days_per_month
    * users
)
# = 0.0129 * 8 * 30 * 5000
# = $15,480 per month

# Per-user cost
per_user_per_month = (
    cost_per_call * calls_per_user_per_day * days_per_month
)
# = 0.0129 * 8 * 30
# = $3.10 per user per month

The numbers above are illustrative. The shape of the calculation generalizes. Plug in your own numbers, get your own answer. The discipline is to do this for every feature, not just to do it once. The reason most teams do not is that the inputs feel uncertain (how many calls per user? how many tokens per call?) and the impulse is to wait until the data is more solid. The waiting is the trap. A rough calculation today catches the order-of-magnitude problems before they ship; a precise calculation in three months catches the order-of-magnitude problems after they have already cost six figures.

The threshold check follows the calculation. The question is: at our current pricing, is the cost per user per month sustainable? If users pay $10 a month and the AI feature costs $3.10 per user per month, the feature is consuming 31% of revenue. That is borderline. If users pay $20 a month, the same cost is 15.5%, which is healthy. If users pay $5 a month, the feature loses money on every user. The same engineering, with the same cost profile, is sustainable in one pricing context and ruinous in another. Knowing which one you are in is the difference between a viable business and a slow-motion bankruptcy.

The honest answer when the threshold check fails. Three options, in order of preference.

First, change the model. Most cost-overrun features are running on a higher tier than the workload requires. Drop a tier and re-test the quality. If the quality holds, ship the cheaper tier. The 5x cost reduction across tiers is enough to fix most threshold problems.

Second, change the design. Maybe the feature is calling the model when it should be caching. Maybe it is sending more context than it needs. Maybe it is making three calls when one would do. Architectural rework is more invasive than model swapping but produces larger savings; this is where the 10x architecture bullets pay off. A redesigned feature is often 5x cheaper than the original at the same quality.

Third, change the pricing. If the feature is genuinely valuable but expensive, charge for it. Move it behind a higher tier of the product, sell it as an add-on, or raise the base price. The honest version of this conversation is uncomfortable, because it forces the team to admit that the AI feature has a real cost that someone has to pay. The dishonest version is to ship the feature at a price that does not cover the cost and hope that something will rescue the unit economics later. The dishonest version is the most common failure mode in early-stage AI products.

$3.10
Example per-user-per-month cost for a Sonnet-tier feature with 8 calls per day, 5,800 input tokens, 600 output tokens, after caching
~30%
Healthy ceiling for AI feature cost as a share of revenue per user; above this, the feature is structurally fragile
3
Levers when the cost-per-user threshold check fails: change the model, change the design, change the pricing

The framework only works if it is run regularly. A cost-per-feature calculation done at launch and never revisited is a snapshot that ages immediately. Calls per user change as the feature gains adoption. Token volumes change as the model is updated. Pricing changes as providers revise their rates. The discipline is to recompute the numbers monthly, not to compute them once. The reward is that the cost surprise that would have arrived in month six arrives in month two instead, when fixing it is still cheap.

One more piece of discipline. Build the cost-per-feature calculation into the product spec for every new AI feature, before the feature is built. Estimate the cost. Run the threshold check. If the feature does not pass the check, redesign before writing code, not after. The cost of redesigning a spec is hours. The cost of redesigning a shipped feature is weeks. The bias toward "we will optimize later" is the single most common reason AI features ship with bad unit economics, and the single easiest way to prevent it is to make the threshold check a gate, not an afterthought.

Common Failure Modes

The list of recurring failure modes in AI cost management is short and stable. Most teams hit most of them on first contact with the discipline; the ones that move past them are the ones that recognize the patterns early.

Defaulting to the highest-tier model for everything. The most expensive habit in the field. The high tier is appropriate for some calls and overkill for most. The cost of running everything at the high tier is roughly 5x the cost of running the right work at the right tier. Fix: build a simple routing layer that selects the model by call complexity. The routing layer is small. The savings are large.

Not caching the system prompt. The single most common implementation gap. A 5K-token system prompt that is paid at full input rate on every call is roughly 10x more expensive than the same system prompt cached. The fix is mechanical: add a cache breakpoint after the stable portion. The discount applies on every subsequent call within the cache TTL. Most teams that have not done this can cut their bill by 50% in an afternoon.

Letting agent loops run unbounded. The pattern is rare per session, but when it triggers, the damage is catastrophic. A runaway loop on a high-tier model can spend $1,000 in a few hours. The fix is hard limits in the agent harness: max turns per session, max cost per session, max duration per session. The limits do not affect normal sessions; they save you from the abnormal ones.

Treating context as free. Every token in the prompt is paid for on every call. Treating the context window as a buffer to fill rather than a budget to spend is the most common form of wasted spend in agent workloads. The fix is the discipline of asking, for every chunk of context, whether the model would produce worse output without it. The answer is usually no.

Not instrumenting per-feature cost. Without per-feature breakdowns, the bill is a single number that is hard to reason about and impossible to optimize. With per-feature breakdowns, every spike is traceable to a specific feature, and the optimization work targets the actual problem. The fix is metadata tagging on every API call, plus a small dashboard that aggregates by feature.

Skipping the cost-per-feature calculation at the spec stage. The fix is to make the calculation a gate on every new AI feature. Estimate the cost. Run the threshold check. If the feature fails, redesign before writing code. The bias is to skip the calculation because the inputs feel uncertain; the cost of skipping is paid later, in larger amounts, in features that ship with broken economics.

Optimizing per-token pricing instead of per-token volume. Provider haggling is a small lever; volume reduction is a large lever. Time spent comparing Claude versus GPT versus Gemini at the same tier is usually time better spent on caching, model routing, and context truncation. The discount across providers at the same tier is typically a few percent to a few tens of percent. The discount from architectural improvements is typically multiples. Spend the optimization budget on the architecture, then shop for the cheapest provider at each tier as a secondary pass.

Not reviewing spend during the first month after launch. The first month is when the cost profile is most likely to surprise you. After the first month, drift is usually slow. The discipline is to schedule weekly spend reviews for the first month of any new feature, then drop to monthly once the profile is stable. The reviews are short; their value is in catching anomalies while they are still small.

When Cost Optimization Is Not the Right Move

An honest section before closing. There are workloads where aggressive cost optimization is the wrong call, and recognizing them is part of the discipline. Three patterns matter.

Early-stage features where the question is whether they work, not what they cost. If you are still figuring out whether a feature produces the value you hoped for, optimizing the cost is premature. Run on the highest-quality model. Use the most generous context. Let the cost be high while you learn whether the feature is worth shipping at all. Once the answer is yes, then the optimization work pays back. The order matters: validate first, optimize second. Optimizing a feature that does not work is wasted effort, and optimizing too early hides the signal you need to decide whether the feature works at all.

High-stakes calls where being wrong is expensive. A code review that catches a critical bug. A medical document analysis that affects patient care. A financial reasoning step that triggers a real transaction. For these calls, the cost difference between Sonnet and Opus is a rounding error compared to the cost of an incorrect output. Use the high tier. Use the more conservative context budget. Do not save $0.10 per call to risk a $10,000 mistake.

Workloads where the alternative is human time. If the question is whether to spend $50 of API credit or two hours of an engineer's time, the API credit is the cheaper option. Cost optimization for these workloads should be measured against the human-time alternative, not against an idealized minimum. A feature that costs $20 per use but saves a senior engineer an hour is a bargain even if a 5x cost reduction were available. Pursue the cost reduction when it is easy; do not pursue it when it would cost more engineering time than it saves in API spend.

The meta-principle: cost optimization is a means to an end. The end is sustainable shipping of features users will pay for. When optimization serves that end, do it. When it does not, skip it. The discipline is not to optimize everything always; it is to know which optimizations matter for which workloads and to prioritize accordingly. The teams that get this right ship cost-disciplined features without spending their engineering time on cost spreadsheets. The teams that overdo it lose more in delayed shipping than they save in tokens.

Closing

Cost management for AI development is not optimization for its own sake. It is the discipline that lets you ship features users will actually pay for. The team that controls token spend ships sustainably, releases new features with confidence in the unit economics, and does not have to re-architect every feature six months in when the bill arrives. The team that ignores it discovers, three months in, that the AI feature their product is built around costs more per user than it earns, and that the architectural choices that produced the cost profile are calcified into the codebase.

The shape of the practice is straightforward. Name the components of your spend so you can see what you are paying for. Pick the smallest model that produces acceptable output for each slice of the work. Cache the stable parts of every prompt; the discount is roughly 90% on the cached portion and is the largest single lever in the field. Treat context as a budget rather than a buffer. Use sub-agents for exploration-heavy tasks so the parent's window stays clean. Compare providers honestly and pick by workload shape, not by allegiance. Instrument per-feature spend, set hard caps, alert at thresholds, build a runaway-loop detector. Apply the architectural patterns together because they compound. Run the cost-per-feature calculation at the spec stage and again monthly after launch.

None of this is exotic. All of it is consistent. The teams that practice the discipline ship features the teams that ignore it cannot afford to ship. The frontier of model capability is moving fast; the frontier of cost discipline moves slowly, because the laws of unit economics are the same as they were before the AI tools existed. The team that combines a strong technical posture on the models with a strong financial posture on the spend is the team that wins on both axes, and the second axis is the one that determines whether the features stay in production once the novelty has worn off.

Treat the per-token cost as the load-bearing variable it is. Not as a finance concern to be addressed quarterly. Not as somebody else's problem. As the silent architectural layer of every AI feature you ship, allocated by you, with the discipline you bring to any other resource that has a price and a limit. The agents are good. The models are getting better every six months. The thing that does not get better automatically is the cost of using them well, and the work you put into that part is the work that compounds across every user, every session, and every line item on every invoice for the life of the product.