The architecture, evals, monitoring, and process disciplines that take an AI feature from "works in a demo" to "survives real users at scale."
The first version of an AI feature usually works in a demo. The version that actually serves real users for months without on-call pages — that's a different beast. The gap between them is the work of this tutorial.
This is the playbook for crossing it.
Goal: prove the LLM can do the task at all.
What to build:
What NOT to build yet:
What to decide at the end of this phase:
If quality is acceptable, advance.
Goal: plug the feature into the Django app and let internal users hit it.
What to build:
# settings.py
AI_FEATURE_X_ENABLED = config("AI_FEATURE_X_ENABLED", default=False, cast=bool)
AI_FEATURE_X_STAFF_ONLY = config("AI_FEATURE_X_STAFF_ONLY", default=True, cast=bool)
# views.py
@login_required
def feature_x(request):
if not settings.AI_FEATURE_X_ENABLED:
raise Http404()
if settings.AI_FEATURE_X_STAFF_ONLY and not request.user.is_staff:
raise Http404()
...
What to decide at the end of this phase:
Goal: stop being able to ship prompt changes without knowing if quality regressed.
What to build:
# evals/test_set.json
[
{"input": "How do I install QuizCraft?", "expected_keyword": "venv"},
{"input": "Refund policy?", "expected_keyword": "30 days"},
...
]
# evals/run_eval.py
def evaluate():
test_set = load_test_set()
results = []
for example in test_set:
actual = ask(example["input"])
passed = example["expected_keyword"].lower() in actual.lower()
results.append({"input": example["input"], "passed": passed, "actual": actual})
pass_rate = sum(r["passed"] for r in results) / len(results)
print(f"Pass rate: {pass_rate:.1%}")
return pass_rate
What to decide at the end of this phase:
Goal: see what real users do that you didn't anticipate.
What to build:
What to expect:
What to decide at the end of this phase:
Goal: make it boring and reliable.
What to build:
# Versioned prompts
PROMPT_VERSION = "summarize_v3"
PROMPTS = {
"summarize_v1": "Summarize this:",
"summarize_v2": "Provide a 2-sentence summary:",
"summarize_v3": "Provide a concise 2-sentence summary preserving key facts:",
}
response = ask(PROMPTS[PROMPT_VERSION])
log_call(prompt_version=PROMPT_VERSION, ...)
When you change a prompt and quality drops, you can correlate the regression with the version change in your logs.
By Phase 5, your AI code should look something like this:
┌─ View layer ──────────────────────────┐
│ - Authentication, rate limit, feature flag │
│ - Calls service layer, returns response │
└──────────────────────────────────────┘
↓
┌─ Service layer ────────────────────────┐
│ - Routes to right model (router) │
│ - Constructs prompt from versioned templates │
│ - Calls inference layer │
│ - Validates output (Pydantic) │
│ - Logs call with all metadata │
└──────────────────────────────────────┘
↓
┌─ Inference layer ──────────────────────┐
│ - Wraps Anthropic SDK │
│ - Handles retries, errors, timeouts │
│ - Manages prompt caching │
└──────────────────────────────────────┘
Three clean layers, each testable. Don't put client.messages.create() in your view.
The metrics that matter:
| Metric | Alert when |
|---|---|
| p50 / p95 latency | p95 > 10 seconds |
| Error rate | > 2% |
| Cost per user per day | > 2x baseline |
| Eval pass rate (weekly) | drops > 10 points |
| User feedback (thumbs down rate) | > 20% |
| Hallucination rate (sample-audited) | > 5% |
Not all of these are easy to measure automatically. The hallucination rate especially needs human spot-checks. Build a weekly review where someone reads 20 random conversations and rates them.
Before launch, write down:
Have answers before you launch, not during the incident.
Common causes of failed AI features:
Each of these is preventable with the practices above.
The roadmap from PoC to production is not about writing better prompts. It's about:
Most demos can be built in a weekend. Most production AI features take 4–8 weeks. The difference is this checklist. Skip it and you'll either ship something flaky or burn money silently.
Take it seriously and you'll ship features that earn user trust and survive scale. Which is the whole point.