How to ship schema changes without locking your production database. NOT NULL adds, column renames, index creation, and the multi-deploy patterns that separate Django teams who break prod from those who don't.
A migration that takes 30 seconds in dev can lock your prod table for 10 minutes and take down checkout. Most Django outages I've seen weren't bad code — they were a single ALTER TABLE in a migration that looked harmless. This guide is the rules I follow before I let migrations near a 50-million-row table.
PostgreSQL takes one of several lock levels for DDL. The dangerous one is ACCESS EXCLUSIVE: nothing reads, nothing writes. Operations that take it on a large table:
NOT NULL to an existing column.VARCHAR(50) → VARCHAR(100)).CONCURRENTLY.The fix: never let a migration both change schema and scan/rewrite the table in the same step.
The naive way:
# BAD — locks the table for the duration of the rewrite
class Migration(migrations.Migration):
operations = [
migrations.AddField("Order", "country", models.CharField(max_length=2, default="NL")),
]
The safe way is three deploys:
migrations.AddField("Order", "country",
models.CharField(max_length=2, null=True, blank=True))
Instant on PG11+. App still works (column is unused).# Run as a one-off management command, NOT a migration
qs = Order.objects.filter(country__isnull=True)
batch = 5000
while qs.exists():
ids = list(qs.values_list("id", flat=True)[:batch])
Order.objects.filter(id__in=ids).update(country="NL")
time.sleep(0.1) # let replica catch up
And update app code to write country on every new Order (so the gap closes from both ends).# PostgreSQL 12+
ALTER TABLE orders ADD CONSTRAINT orders_country_notnull CHECK (country IS NOT NULL) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT orders_country_notnull;
ALTER TABLE orders ALTER COLUMN country SET NOT NULL;
ALTER TABLE orders DROP CONSTRAINT orders_country_notnull;
NOT VALID + VALIDATE avoids the table-rewriting lock. The final SET NOT NULL uses an existing valid CHECK and is near-instant.Yes, three deploys. Yes, it's worth it. Your alternative is being on a Sunday-night call.
RENAME COLUMN is metadata-only and fast. The problem isn't PostgreSQL — it's app code. Between deploy and rollback you'd have running web/worker processes referring to the old name. Pattern:
email_address (nullable).email and email_address.email_address from email.email_address.email.email.Six steps. Two are deploys; the others are background jobs. If anyone tells you they can rename a hot column in one deploy on a 100M-row table, ask to see the post-mortem.
class Migration(migrations.Migration):
atomic = False # CRITICAL — concurrently can't run in a transaction
operations = [
migrations.RunSQL(
"CREATE INDEX CONCURRENTLY orders_email_idx ON orders (email);",
reverse_sql="DROP INDEX CONCURRENTLY orders_email_idx;",
),
]
CONCURRENTLY doesn't take a write lock — but it can fail half-built. If it does, you'll see INVALID in \d; drop it and retry.
RunPython in a migration runs during deploy, holding a transaction. If your data step touches 10 million rows, the deploy blocks for an hour and the worker pool fills up. Rule: RunPython is for tiny, bounded fixups (a few hundred rows max). Anything larger goes to a management command run separately, with batching and sleeps.
Frame every schema change as: "What does this look like at any moment during the deploy, when half the workers are on old code and half on new?" If the answer involves "old code panics because country is NOT NULL but it doesn't know that field exists" — your migration is unsafe.
Zero-downtime is a discipline, not a tool. Add columns nullable, backfill in chunks, gate on NOT VALID constraints, never rename in one deploy, always create indexes CONCURRENTLY, and keep RunPython for tiny things. Read every migration with the question: "What if this took 10 minutes?" If the answer is "outage," redesign the migration.