Token economics, model tier selection, prompt caching at scale, and the monitoring you need before AI features eat your budget.
AI API bills go from $20/month to $20,000/month faster than any other line item if a feature gets popular and the implementation is wasteful. Optimization is best done before scale, not after.
This tutorial covers the cost levers that move the needle, in rough order of impact.
Claude pricing (May 2026):
| Model | Input ($/M tokens) | Output ($/M tokens) | Cached input |
|---|---|---|---|
| Haiku 4.5 | $0.80 | $4 | $0.08 |
| Sonnet 4.6 | $3 | $15 | $0.30 |
| Opus 4.7 | $15 | $75 | $1.50 |
A 1000-token request with a 500-token response on Sonnet:
That seems trivial. Multiply by 100,000 calls per month = $1,000. Multiply by 1M calls = $10,000. The unit economics need to make sense at the volume you'll actually have.
Don't default to Opus. Most production tasks fit Sonnet or Haiku.
Practical guide:
A common pattern is the router — a cheap classifier picks the right tier:
def smart_ask(question: str) -> str:
# Cheap classifier in Haiku
complexity = classify_with_haiku(question)
# Returns "simple", "moderate", or "complex"
model_map = {
"simple": "claude-haiku-4-5",
"moderate": "claude-sonnet-4-6",
"complex": "claude-opus-4-7",
}
return ask_with_model(question, model_map[complexity])
The classifier costs $0.0001 per call. The savings from routing 80% of traffic to Haiku/Sonnet vs always using Opus easily pays for it.
If you have a long, stable prefix on most requests (system prompt, examples, document context), prompt caching is the biggest single cost win available.
response = client.messages.create(
model="claude-sonnet-4-6",
system=[
{
"type": "text",
"text": LONG_STABLE_INSTRUCTIONS, # 5000 tokens
"cache_control": {"type": "ephemeral"},
},
],
messages=[...],
)
The first call writes to cache (priced at the standard rate). The next call within ~5 minutes that has the same prefix reads from cache at 10% of the input price. So 5000 tokens of system prompt drops from $0.015 per call to $0.0015 per call after the first.
For a high-volume feature with 1M calls/month:
Tens of thousands of dollars per month, with one parameter change.
What to cache:
Some requests are repeats. Same question, same answer (often). Cache the full response in Redis:
import hashlib
from django.core.cache import cache
def cached_ask(prompt: str, ttl: int = 3600) -> str:
key = "ai:ans:" + hashlib.sha256(prompt.encode()).hexdigest()
answer = cache.get(key)
if answer is not None:
return answer
answer = ask_claude(prompt)
cache.set(key, answer, ttl)
return answer
For deterministic-ish queries (FAQs, "what's the price of X", "summarize this fixed document"), cache hit rates of 30–80% are normal. Each hit = $0 instead of $0.01.
Be careful with this for personalized/contextual queries — caching them across users leaks data.
If you send 5000 tokens but the model only needs 500, you're paying 10x more than necessary.
For RAG, retrieve the right chunks and stop. Don't dump 50 chunks into the prompt "just in case":
# Bad
context = "\n\n".join(c.content for c in DocumentChunk.objects.all()[:50])
# Better
chunks = retrieve(question, k=5) # Top 5 most relevant
context = "\n\n".join(c.content for c in chunks)
For chat history, summarize old turns instead of replaying them all:
# Bad: replay every turn
messages = ChatMessage.objects.filter(session=session).order_by("created_at")[:50]
# Better: keep recent turns verbatim, summarize older ones
recent = list(messages[-10:])
older = messages[:-10]
if older:
summary = ask_claude(f"Summarize this conversation: {older}")
history = [{"role": "system", "content": f"Earlier summary: {summary}"}, *recent]
The default max_tokens (often 4096) doesn't bound your bill — the model can produce up to that many output tokens, billed at the high output rate.
If you need a 100-word answer, set max_tokens=200 (around 150 tokens). Forces brevity, caps cost.
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=300, # Not 4096
messages=[...],
)
This is one of the easiest wins and most teams overlook it.
If you're processing 1000 items, doing them serially is slow but simple. Doing them in parallel with asyncio.gather is fast and uses the same total tokens.
But for many workloads, you can do better with batch inference — Anthropic's batch API runs jobs at 50% off in exchange for a slower turnaround:
batch = client.messages.batches.create(
requests=[
{"custom_id": f"item-{i}", "params": {...}}
for i in range(1000)
]
)
# Result available within 24 hours, half the cost
For nightly indexing, weekly classification runs, etc. — half the bill, no engineering complexity.
You can't optimize what you don't measure. Add per-call logging:
import time
import logging
logger = logging.getLogger("ai.cost")
def tracked_ask(prompt: str, feature: str) -> str:
t0 = time.monotonic()
response = client.messages.create(...)
elapsed = time.monotonic() - t0
logger.info(
"ai_call",
extra={
"feature": feature,
"model": response.model,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cache_read_tokens": getattr(response.usage, "cache_read_input_tokens", 0),
"elapsed_ms": int(elapsed * 1000),
},
)
return response.content[0].text
Pipe to Sentry, Datadog, or your existing logging. Build a daily Grafana dashboard:
When a feature spikes (someone wrote a bot, a viral mention drove traffic), you see it within hours, not at the end of the month.
For consumer features, set per-user rate limits in Redis:
def can_ask(user_id: int) -> bool:
key = f"ai:limit:{user_id}:{datetime.utcnow():%Y-%m-%d}"
count = cache.get(key, 0)
if count >= DAILY_LIMIT:
return False
cache.set(key, count + 1, 86400)
return True
For per-feature budgets, use circuit breakers — if today's spend exceeds X, switch to a cheaper model or disable the feature entirely until tomorrow.
A reasonable mature setup:
max_tokens set tightly per featureThis setup typically cuts AI bills by 60–90% vs naive implementations, with no quality loss.
That said: if you're shipping a beta with 50 users, don't build all of this on day 1. Get the feature working, then instrument it, then optimize the top costs. Premature optimization is real even with AI.
The right order is usually:
Each layer pays for the engineering time it takes to add it, with margin.