Three pillars — logs, metrics, traces — wired into a real Django app. Sentry for errors, structlog for context, django-prometheus for golden signals, OpenTelemetry for distributed traces. The minimum to debug prod without ssh-ing in.
If your only debugging tool in production is ssh + tail -f, you're flying blind. Modern observability is three signals — logs, metrics, traces — instrumented at the boundaries of your code. This is the smallest setup that gives you actionable visibility into a Django app.
pip install "sentry-sdk[django,celery]"
# settings.py
import sentry_sdk
from sentry_sdk.integrations.django import DjangoIntegration
from sentry_sdk.integrations.celery import CeleryIntegration
sentry_sdk.init(
dsn=os.environ["SENTRY_DSN"],
integrations=[DjangoIntegration(), CeleryIntegration()],
traces_sample_rate=0.05, # 5% of transactions for performance
profiles_sample_rate=0.05,
send_default_pii=False, # IMPORTANT — never send raw form data
environment=os.environ.get("ENV", "production"),
release=os.environ.get("GIT_SHA", "unknown"),
)
Tag releases with the git SHA so you can see "this regression appeared in abc123." Sentry de-dupes by stack trace, so a flood of one error becomes one issue with a count, not 10,000 alerts.
String-formatted logs are useless at scale. Structured logs are JSON with stable keys you can grep, filter, and aggregate.
pip install structlog
# settings.py
import structlog
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.StackInfoRenderer(),
structlog.processors.JSONRenderer(),
],
wrapper_class=structlog.make_filtering_bound_logger(20), # INFO
)
# In a middleware: bind request_id and user_id once, get them on every log
import uuid
def request_context_middleware(get_response):
def middleware(request):
structlog.contextvars.clear_contextvars()
structlog.contextvars.bind_contextvars(
request_id=str(uuid.uuid4()),
user_id=getattr(request.user, "id", None),
path=request.path,
)
return get_response(request)
return middleware
Now log.info("order_paid", order_id=o.id, amount=o.total) emits a JSON line with all the request context attached. Ship to Loki, Elasticsearch, or CloudWatch — they all index it.
pip install django-prometheus
# settings.py
INSTALLED_APPS = ["django_prometheus", *INSTALLED_APPS]
MIDDLEWARE = ["django_prometheus.middleware.PrometheusBeforeMiddleware",
*MIDDLEWARE,
"django_prometheus.middleware.PrometheusAfterMiddleware"]
# urls.py
urlpatterns += [path("metrics/", include("django_prometheus.urls"))]
Out of the box you get request rate, latency histograms, DB query counts, cache hits, and migration status. Scrape /metrics/ from Prometheus every 15s. The "golden signals" to alert on:
django_http_requests_latency_seconds.django_http_responses_total.Lock down /metrics/ to internal IPs only. Do not expose it publicly.
from prometheus_client import Counter, Histogram
orders_total = Counter("orders_total", "Orders created", ["payment_method"])
order_latency = Histogram("order_create_seconds", "Order creation time")
@order_latency.time()
def create_order(...):
...
orders_total.labels(payment_method="card").inc()
Logs and metrics tell you "something is slow." Traces tell you where. OpenTelemetry is the vendor-neutral standard.
pip install opentelemetry-distro opentelemetry-exporter-otlp \
opentelemetry-instrumentation-django opentelemetry-instrumentation-psycopg2 \
opentelemetry-instrumentation-redis opentelemetry-instrumentation-requests
opentelemetry-bootstrap -a install
Run gunicorn under opentelemetry-instrument:
OTEL_SERVICE_NAME=djzen \
OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp.honeycomb.io \
OTEL_TRACES_SAMPLER=traceidratio OTEL_TRACES_SAMPLER_ARG=0.1 \
opentelemetry-instrument gunicorn djzen.wsgi
Now every request shows: HTTP receive → middleware → view → ORM queries → Redis → external HTTP → render → response. Click the slow span to see the SQL. Click the SQL to see the EXPLAIN. This is how you find the unexpected N+1 in production code.
Page-worthy alerts (wake someone at 3am):
Everything else is a Slack notification, not a page. Alert fatigue is the silent killer of on-call.
Traces and structured logs are expensive at scale. Sample at 1–10% in normal traffic, but always keep 100% of slow/errored requests:
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.05 # 5% baseline
# always-on for errors via tail-based sampler in your collector
Sentry for "what broke." Structured logs for "what was happening at the time." Prometheus for "is it healthy now?" OpenTelemetry for "where's the latency?" Set this up before you need it — under a real outage, you'll learn nothing without it.