AI & LLMs Intermediate

Cost Optimization for AI-Powered Django Applications

Token economics, model tier selection, prompt caching at scale, and the monitoring you need before AI features eat your budget.

DjangoZen Team May 09, 2026 10 min read 4 views

Why this matters early

AI API bills go from $20/month to $20,000/month faster than any other line item if a feature gets popular and the implementation is wasteful. Optimization is best done before scale, not after.

This tutorial covers the cost levers that move the needle, in rough order of impact.

Token economics in plain numbers

Claude pricing (May 2026):

Model	Input ($/M tokens)	Output ($/M tokens)	Cached input
Haiku 4.5	$0.80	$4	$0.08
Sonnet 4.6	$3	$15	$0.30
Opus 4.7	$15	$75	$1.50

A 1000-token request with a 500-token response on Sonnet:

Input: 1000 × $3 / 1,000,000 = $0.003
Output: 500 × $15 / 1,000,000 = $0.0075
Total: ~$0.01 per call

That seems trivial. Multiply by 100,000 calls per month = $1,000. Multiply by 1M calls = $10,000. The unit economics need to make sense at the volume you'll actually have.

Lever 1 — Pick the right model

Don't default to Opus. Most production tasks fit Sonnet or Haiku.

Practical guide:

Classification, extraction, simple Q&A: Haiku. 4x cheaper than Sonnet, plenty smart for these.
Most chat, RAG answer generation, summarization: Sonnet 4.6. The default workhorse.
Complex code generation, multi-step reasoning, careful writing where mistakes are costly: Opus 4.7.

A common pattern is the router — a cheap classifier picks the right tier:

def smart_ask(question: str) -> str:
    # Cheap classifier in Haiku
    complexity = classify_with_haiku(question)
    # Returns "simple", "moderate", or "complex"

    model_map = {
        "simple": "claude-haiku-4-5",
        "moderate": "claude-sonnet-4-6",
        "complex": "claude-opus-4-7",
    }
    return ask_with_model(question, model_map[complexity])

The classifier costs $0.0001 per call. The savings from routing 80% of traffic to Haiku/Sonnet vs always using Opus easily pays for it.

Lever 2 — Prompt caching

If you have a long, stable prefix on most requests (system prompt, examples, document context), prompt caching is the biggest single cost win available.

response = client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {
            "type": "text",
            "text": LONG_STABLE_INSTRUCTIONS,  # 5000 tokens
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[...],
)

The first call writes to cache (priced at the standard rate). The next call within ~5 minutes that has the same prefix reads from cache at 10% of the input price. So 5000 tokens of system prompt drops from $0.015 per call to $0.0015 per call after the first.

For a high-volume feature with 1M calls/month:

Without caching: 5000 × $3 / 1M × 1M = $15,000/mo on system prompt alone
With 95% cache hit rate: $750 + $14,250 × 10% = ~$2,175/mo

Tens of thousands of dollars per month, with one parameter change.

What to cache:

✅ System prompts that don't change per request
✅ Tool definitions
✅ Long reference documents that are referenced often
✅ Few-shot examples
❌ The user's per-call query (changes every time, never hits cache)

Lever 3 — Aggressive Redis caching of full responses

Some requests are repeats. Same question, same answer (often). Cache the full response in Redis:

import hashlib
from django.core.cache import cache

def cached_ask(prompt: str, ttl: int = 3600) -> str:
    key = "ai:ans:" + hashlib.sha256(prompt.encode()).hexdigest()
    answer = cache.get(key)
    if answer is not None:
        return answer
    answer = ask_claude(prompt)
    cache.set(key, answer, ttl)
    return answer

For deterministic-ish queries (FAQs, "what's the price of X", "summarize this fixed document"), cache hit rates of 30–80% are normal. Each hit = $0 instead of $0.01.

Be careful with this for personalized/contextual queries — caching them across users leaks data.

Lever 4 — Truncate inputs aggressively

If you send 5000 tokens but the model only needs 500, you're paying 10x more than necessary.

For RAG, retrieve the right chunks and stop. Don't dump 50 chunks into the prompt "just in case":

# Bad
context = "\n\n".join(c.content for c in DocumentChunk.objects.all()[:50])

# Better
chunks = retrieve(question, k=5)  # Top 5 most relevant
context = "\n\n".join(c.content for c in chunks)

For chat history, summarize old turns instead of replaying them all:

# Bad: replay every turn
messages = ChatMessage.objects.filter(session=session).order_by("created_at")[:50]

# Better: keep recent turns verbatim, summarize older ones
recent = list(messages[-10:])
older = messages[:-10]
if older:
    summary = ask_claude(f"Summarize this conversation: {older}")
    history = [{"role": "system", "content": f"Earlier summary: {summary}"}, *recent]

Lever 5 — Set max_tokens explicitly and tightly

The default max_tokens (often 4096) doesn't bound your bill — the model can produce up to that many output tokens, billed at the high output rate.

If you need a 100-word answer, set max_tokens=200 (around 150 tokens). Forces brevity, caps cost.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=300,  # Not 4096
    messages=[...],
)

This is one of the easiest wins and most teams overlook it.

Lever 6 — Batch processing where possible

If you're processing 1000 items, doing them serially is slow but simple. Doing them in parallel with asyncio.gather is fast and uses the same total tokens.

But for many workloads, you can do better with batch inference — Anthropic's batch API runs jobs at 50% off in exchange for a slower turnaround:

batch = client.messages.batches.create(
    requests=[
        {"custom_id": f"item-{i}", "params": {...}}
        for i in range(1000)
    ]
)
# Result available within 24 hours, half the cost

For nightly indexing, weekly classification runs, etc. — half the bill, no engineering complexity.

Lever 7 — Monitor everything

You can't optimize what you don't measure. Add per-call logging:

import time
import logging

logger = logging.getLogger("ai.cost")

def tracked_ask(prompt: str, feature: str) -> str:
    t0 = time.monotonic()
    response = client.messages.create(...)
    elapsed = time.monotonic() - t0
    logger.info(
        "ai_call",
        extra={
            "feature": feature,
            "model": response.model,
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
            "cache_read_tokens": getattr(response.usage, "cache_read_input_tokens", 0),
            "elapsed_ms": int(elapsed * 1000),
        },
    )
    return response.content[0].text

Pipe to Sentry, Datadog, or your existing logging. Build a daily Grafana dashboard:

Total tokens by feature
Cache hit rate by feature
Cost per active user per day
p95 latency by feature

When a feature spikes (someone wrote a bot, a viral mention drove traffic), you see it within hours, not at the end of the month.

Lever 8 — Set hard budget guardrails

For consumer features, set per-user rate limits in Redis:

def can_ask(user_id: int) -> bool:
    key = f"ai:limit:{user_id}:{datetime.utcnow():%Y-%m-%d}"
    count = cache.get(key, 0)
    if count >= DAILY_LIMIT:
        return False
    cache.set(key, count + 1, 86400)
    return True

For per-feature budgets, use circuit breakers — if today's spend exceeds X, switch to a cheaper model or disable the feature entirely until tomorrow.

What this looks like in practice

A reasonable mature setup:

Tier router (Haiku → Sonnet → Opus based on classifier)
Prompt caching on every system prompt > 1000 tokens
Redis caching on response for FAQ-style queries
RAG with k=5 retrieval, no extras
max_tokens set tightly per feature
Per-feature, per-user, per-day rate limits
Logging every call with cost data
Daily dashboard, weekly review

This setup typically cuts AI bills by 60–90% vs naive implementations, with no quality loss.

Don't optimize prematurely

That said: if you're shipping a beta with 50 users, don't build all of this on day 1. Get the feature working, then instrument it, then optimize the top costs. Premature optimization is real even with AI.

The right order is usually:

Ship with naive implementation, measure cost per feature
Add prompt caching once cost > $50/month
Add tier routing once cost > $500/month
Add response caching once cost > $1000/month
Add batch processing for nightly jobs once cost > $5000/month

Each layer pays for the engineering time it takes to add it, with margin.

More AI & LLMs Tutorials All Tutorials

Ready to Build?

Skip the boilerplate. Get production-ready Django packages.

Browse Products

Cost Optimization for AI-Powered Django Applications

Why this matters early

Token economics in plain numbers

Lever 1 — Pick the right model

Lever 2 — Prompt caching

Lever 3 — Aggressive Redis caching of full responses

Lever 4 — Truncate inputs aggressively

Lever 5 — Set max_tokens explicitly and tightly

Lever 6 — Batch processing where possible

Lever 7 — Monitor everything

Lever 8 — Set hard budget guardrails

What this looks like in practice

Don't optimize prematurely

Related Tutorials

The AI Application Roadmap — From PoC to Production

Streaming AI Responses with Django and Server-Sent Events

Prompt Engineering Patterns for Production Django Apps

Choosing a Vector Database — pgvector vs Pinecone vs Weaviate vs Qdrant

Building a RAG System in Django with PostgreSQL and pgvector

Categories

Ready to Build?