DevOps Advanced

Observability for Django: Sentry, Structured Logging, Prometheus, and OpenTelemetry Tracing

Three pillars — logs, metrics, traces — wired into a real Django app. Sentry for errors, structlog for context, django-prometheus for golden signals, OpenTelemetry for distributed traces. The minimum to debug prod without ssh-ing in.

DjangoZen Team Apr 25, 2026 20 min read 6 views

If your only debugging tool in production is ssh + tail -f, you're flying blind. Modern observability is three signals — logs, metrics, traces — instrumented at the boundaries of your code. This is the smallest setup that gives you actionable visibility into a Django app.

The three pillars in one sentence each

  • Logs: a stream of events with context. Answer "what happened?"
  • Metrics: numeric time-series, aggregated. Answer "is it healthy right now?"
  • Traces: follow a single request across services. Answer "where did the latency go?"

Sentry — error tracking, ten lines

pip install "sentry-sdk[django,celery]"
# settings.py
import sentry_sdk
from sentry_sdk.integrations.django import DjangoIntegration
from sentry_sdk.integrations.celery import CeleryIntegration

sentry_sdk.init(
    dsn=os.environ["SENTRY_DSN"],
    integrations=[DjangoIntegration(), CeleryIntegration()],
    traces_sample_rate=0.05,    # 5% of transactions for performance
    profiles_sample_rate=0.05,
    send_default_pii=False,     # IMPORTANT — never send raw form data
    environment=os.environ.get("ENV", "production"),
    release=os.environ.get("GIT_SHA", "unknown"),
)

Tag releases with the git SHA so you can see "this regression appeared in abc123." Sentry de-dupes by stack trace, so a flood of one error becomes one issue with a count, not 10,000 alerts.

Structured logs with structlog

String-formatted logs are useless at scale. Structured logs are JSON with stable keys you can grep, filter, and aggregate.

pip install structlog
# settings.py
import structlog

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.StackInfoRenderer(),
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.make_filtering_bound_logger(20),  # INFO
)

# In a middleware: bind request_id and user_id once, get them on every log
import uuid
def request_context_middleware(get_response):
    def middleware(request):
        structlog.contextvars.clear_contextvars()
        structlog.contextvars.bind_contextvars(
            request_id=str(uuid.uuid4()),
            user_id=getattr(request.user, "id", None),
            path=request.path,
        )
        return get_response(request)
    return middleware

Now log.info("order_paid", order_id=o.id, amount=o.total) emits a JSON line with all the request context attached. Ship to Loki, Elasticsearch, or CloudWatch — they all index it.

Prometheus — the four golden signals

pip install django-prometheus
# settings.py
INSTALLED_APPS = ["django_prometheus", *INSTALLED_APPS]
MIDDLEWARE = ["django_prometheus.middleware.PrometheusBeforeMiddleware",
              *MIDDLEWARE,
              "django_prometheus.middleware.PrometheusAfterMiddleware"]

# urls.py
urlpatterns += [path("metrics/", include("django_prometheus.urls"))]

Out of the box you get request rate, latency histograms, DB query counts, cache hits, and migration status. Scrape /metrics/ from Prometheus every 15s. The "golden signals" to alert on:

  • Latency — p95 of django_http_requests_latency_seconds.
  • Traffic — request rate.
  • Errors — rate of 5xx django_http_responses_total.
  • Saturation — DB connection pool usage, queue depth.

Lock down /metrics/ to internal IPs only. Do not expose it publicly.

Custom metrics worth tracking

from prometheus_client import Counter, Histogram

orders_total  = Counter("orders_total", "Orders created", ["payment_method"])
order_latency = Histogram("order_create_seconds", "Order creation time")

@order_latency.time()
def create_order(...):
    ...
    orders_total.labels(payment_method="card").inc()

Distributed tracing with OpenTelemetry

Logs and metrics tell you "something is slow." Traces tell you where. OpenTelemetry is the vendor-neutral standard.

pip install opentelemetry-distro opentelemetry-exporter-otlp \
    opentelemetry-instrumentation-django opentelemetry-instrumentation-psycopg2 \
    opentelemetry-instrumentation-redis opentelemetry-instrumentation-requests
opentelemetry-bootstrap -a install

Run gunicorn under opentelemetry-instrument:

OTEL_SERVICE_NAME=djzen \
OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp.honeycomb.io \
OTEL_TRACES_SAMPLER=traceidratio OTEL_TRACES_SAMPLER_ARG=0.1 \
opentelemetry-instrument gunicorn djzen.wsgi

Now every request shows: HTTP receive → middleware → view → ORM queries → Redis → external HTTP → render → response. Click the slow span to see the SQL. Click the SQL to see the EXPLAIN. This is how you find the unexpected N+1 in production code.

Alerting — fewer is better

Page-worthy alerts (wake someone at 3am):

  • 5xx rate > 1% over 5 min.
  • p95 latency > 2× baseline for 10 min.
  • Worker queue depth growing for 15 min.
  • DB connection pool >90% saturation.

Everything else is a Slack notification, not a page. Alert fatigue is the silent killer of on-call.

Cost-aware sampling

Traces and structured logs are expensive at scale. Sample at 1–10% in normal traffic, but always keep 100% of slow/errored requests:

OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.05  # 5% baseline
# always-on for errors via tail-based sampler in your collector

Summary

Sentry for "what broke." Structured logs for "what was happening at the time." Prometheus for "is it healthy now?" OpenTelemetry for "where's the latency?" Set this up before you need it — under a real outage, you'll learn nothing without it.