DevOps Advanced

Observability for Django: Sentry, Structured Logging, Prometheus, and OpenTelemetry Tracing

Three pillars — logs, metrics, traces — wired into a real Django app. Sentry for errors, structlog for context, django-prometheus for golden signals, OpenTelemetry for distributed traces. The minimum to debug prod without ssh-ing in.

DjangoZen Team Apr 25, 2026 18 min read 143 views

When production breaks at 2am, the question is not whether you have logs — it is whether you can answer "what is happening, right now, and why" in minutes instead of hours. Observability is the practice that makes a running system explainable: errors you see immediately, logs you can actually query, metrics that reveal trends, and traces that follow one request across every service it touches. This tutorial builds all four pillars for Django: Sentry, structured logging, Prometheus metrics, and OpenTelemetry tracing.

Monitoring versus observability

Monitoring tells you that something is wrong — a dashboard goes red, an alert fires. Observability tells you why — it is the property of a system that lets you ask arbitrary questions about its behavior from the outside, without shipping new code to investigate. The distinction matters because modern systems fail in ways you did not predict, and a fixed set of dashboards cannot anticipate every question. Good observability means that when a novel problem appears, you already have the data to explore it. The four pillars below each answer a different kind of question, and together they turn a production incident from a guessing game into an investigation with evidence.

Error tracking with Sentry

The first and highest-value pillar is error tracking, and Sentry is the standard. Without it, errors hide in log files nobody reads until a user complains; with it, every exception is captured with its full stack trace, the request that caused it, the user affected, and the surrounding context, then grouped so a thousand occurrences of one bug are a single actionable issue.

import sentry_sdk
from sentry_sdk.integrations.django import DjangoIntegration

sentry_sdk.init(dsn="https://...@sentry.io/...",
                integrations=[DjangoIntegration()],
                traces_sample_rate=0.1, send_default_pii=False)

Setup is minutes, and the payoff is immediate: you learn about errors before users report them, with everything you need to reproduce and fix them already attached.

Making error tracking useful

Error tracking degrades fast without hygiene. A flood of noisy, low-value errors trains the team to ignore the tool, so the discipline is to keep the signal high: filter out expected exceptions, resolve or mute issues deliberately, and attach context — the user, the tenant, the relevant IDs — so each error is actionable rather than a bare stack trace. Set up alerting for new or spiking issues so a fresh bug reaches you fast, but tune it so routine noise does not. An error tracker that everyone trusts because its alerts always mean something is worth far more than one that cries wolf and gets muted.

Structured logging

Traditional logs are strings written for humans, which makes them nearly impossible to query at scale — you cannot reliably filter millions of free-text lines by user or request. Structured logging emits logs as machine-readable key-value data (typically JSON), so every log carries queryable fields:

import structlog
log = structlog.get_logger()
log.info("order_placed", order_id=order.id, user_id=user.id,
         total=str(order.total), tenant_id=tenant.id)

Now you can ask your log system "show me all order_placed events for tenant 7 over $500 today" and get an answer. The shift from logs-as-prose to logs-as-data is what makes logs a genuine investigative tool rather than a wall of text you grep in desperation.

Contextual logging and correlation

Structured logs become powerful when every log line from a single request shares a correlation ID, so you can reconstruct the full story of one request across many log statements. Bind request-scoped context — a request ID, the user, the tenant — once, and have it automatically attached to every subsequent log in that request. This means when you find one interesting log line, you can pull every other line from the same request instantly. In a distributed system, propagating that correlation ID across service boundaries lets you follow one user action through every service's logs. Correlation is what turns scattered log lines into a coherent narrative of what happened.

Log levels and what to log

Logging everything is as useless as logging nothing — the signal drowns. Use levels deliberately: DEBUG for development detail, INFO for significant business events, WARNING for recoverable problems worth noticing, ERROR for failures. Log the events that tell the story of your system's behavior — a placed order, a failed payment, a permission denial — not every trivial step. And never log secrets, tokens, passwords, or sensitive personal data, because logs are widely accessible and long-lived; a logged credential is a leaked credential. Thoughtful logging is curation: enough to explain the system, not so much that the important events are buried under noise.

Metrics with Prometheus

Metrics are numeric measurements over time — request rate, error rate, latency, queue depth, active users — and they reveal trends and aggregate health that individual logs and errors cannot. Prometheus is the standard: your app exposes a metrics endpoint, Prometheus scrapes it periodically, and you query and graph the results.

from prometheus_client import Counter, Histogram
orders = Counter("orders_total", "Orders placed", ["status"])
latency = Histogram("request_latency_seconds", "Request latency", ["view"])

Counters only go up (total orders, total errors), gauges move both ways (current connections), and histograms capture distributions (latency percentiles). Metrics are how you see the shape of your system's behavior over time rather than one moment of it.

The golden signals

With metrics you could measure anything, so it helps to start with the signals that matter most. The "four golden signals" are latency (how long requests take), traffic (how many requests), errors (how many fail), and saturation (how full your resources are). Tracked together, these four give a remarkably complete picture of service health and are usually enough to detect most problems. Latency and error rate at the percentiles that matter — p95, p99, not just the average — reveal pain that averages hide. Beginning your metrics with the golden signals, then adding business-specific metrics like orders or signups, gives you broad coverage without drowning in dashboards.

Dashboards and alerting

Metrics are only useful if someone sees them, and the two ways are dashboards and alerts. Dashboards (typically Grafana over Prometheus) visualize trends for at-a-glance health and investigation. Alerts watch metrics and notify you when something crosses a threshold — error rate spiking, latency climbing, a queue backing up — so you respond before users feel it. The art of alerting is the same as error hygiene: alert on symptoms users care about, make every alert actionable, and avoid noise that trains people to ignore the pager. A well-tuned alert that means "users are being hurt right now" is the payoff of all the metrics work.

Distributed tracing with OpenTelemetry

In a system of multiple services, a single user request fans out across many of them, and the hardest question is "where did the time go?" Distributed tracing answers it. A trace follows one request end to end, recording each step — each service call, each database query — as a timed span, assembled into a tree that shows exactly where the request spent its time. OpenTelemetry is the vendor-neutral standard for producing this data, with instrumentation for Django that captures requests and queries automatically. Tracing turns "the request was slow" into "it spent 800ms in the recommendations service waiting on one query," which is the difference between guessing and knowing.

Spans, context, and propagation

A trace is built from spans, each representing a unit of work with a start, end, and metadata, nested to show causality. The key mechanism is context propagation: a trace ID is passed from service to service (in request headers), so every span across every service is linked into one trace. This is what lets you see a single user action stitched together across your whole architecture, even as it crosses network boundaries. Add custom spans around your own expensive operations to make them visible in the trace. Understanding spans and propagation is what lets tracing reveal the true, end-to-end path of a request rather than isolated fragments.

Sampling and cost

Tracing and detailed telemetry generate enormous data volume, and capturing every trace is expensive and usually unnecessary. Sampling keeps a representative fraction — perhaps ten percent of normal traffic — while ensuring you capture the interesting cases like errors and slow requests, which is where tail-based sampling helps by deciding what to keep after seeing the whole trace. The goal is enough data to diagnose problems without paying to store every routine request. Balancing observability depth against cost is a real operational concern; sample the mundane, keep the anomalous, and you get the insight you need at a price you can sustain.

Tying the pillars together

The four pillars are most powerful when connected. An error in Sentry links to the trace that produced it; that trace links to the structured logs from the same request via a shared correlation ID; and metrics show whether this error is an isolated event or part of a spike. This connected workflow is the goal: you notice a problem from a metric or alert, jump to the error for the stack trace, follow the trace to see where time and failure occurred, and read the correlated logs for the detail. A correlation or trace ID threaded through all four is the glue that turns separate tools into a single investigative surface.

SLIs, SLOs, and error budgets

Observability data becomes a management tool through service-level objectives. A service-level indicator is a measured signal — say the percentage of requests served under 300ms; a service-level objective is the target for it, such as 99.9% over a month; and the error budget is the allowed shortfall. This framework turns vague aspirations about reliability into concrete numbers everyone agrees on, and the error budget gives a principled way to balance shipping features against stability: when you are within budget you can move fast, and when you are burning it you slow down and shore up reliability. Grounding reliability in measured objectives rather than opinion is one of the most valuable things observability enables.

The cardinality trap in metrics

A subtle but expensive mistake with metrics is high cardinality — attaching labels with many possible values, like a user ID or a request ID, to a metric. Each unique combination of label values creates a separate time series, and labels with thousands or millions of values explode the number of series, overwhelming your metrics system and ballooning cost. Labels should be low-cardinality dimensions you slice by — status code, endpoint, region — not unbounded identifiers, which belong in logs and traces instead. Understanding the cardinality trap is essential to using metrics sustainably; it is the difference between a metrics system that stays fast and affordable and one that collapses under its own series count.

Avoiding alert fatigue

The failure mode of alerting is not too few alerts but too many: a flood of noisy, non-actionable alerts trains responders to ignore the pager, so the one alert that matters gets missed among the noise. Good alerting is ruthless about signal — alert on symptoms users actually feel, make every alert require a human action, and route informational signals to dashboards rather than the pager. Tuning alerts so each one means "something is wrong that needs you now" preserves the trust that makes alerting work. An alerting system everyone trusts because it never cries wolf is far more valuable than a comprehensive one that everyone has muted.

Observing the user's experience

Server-side observability tells you what your backend did, but the user experiences the whole journey including the browser, the network, and rendering. Real user monitoring captures performance from the user's side — how long pages actually took to load and become interactive for real visitors on real devices and connections. This often reveals that a backend you measure as fast is slow in users' hands due to network latency, heavy assets, or client rendering. Complementing backend observability with a view of the actual user experience closes the gap between "our servers responded quickly" and "our users had a fast experience," which are not the same thing and sometimes diverge sharply.

Summary

Observability is what makes a running system explainable when it matters most, and it rests on four pillars. Sentry captures errors with the full context to fix them, kept useful through hygiene and tuned alerting. Structured logging turns logs from un-queryable prose into machine-readable data, made coherent by correlation IDs that reconstruct each request. Prometheus metrics reveal trends and aggregate health — start with the golden signals of latency, traffic, errors, and saturation — surfaced through dashboards and actionable alerts. OpenTelemetry tracing follows one request across every service via context propagation, showing exactly where time and failures occur, sampled to control cost. Connect all four with a shared trace or correlation ID, and an incident becomes an investigation with evidence rather than a frantic guess. Invest in observability before you need it, because the time to instrument your system is not while it is on fire.