Database Advanced

Zero-Downtime Migrations for Django and PostgreSQL: Locks, Backfills, and the Rename Dance

How to ship schema changes without locking your production database. NOT NULL adds, column renames, index creation, and the multi-deploy patterns that separate Django teams who break prod from those who don't.

DjangoZen Team Apr 25, 2026 20 min read 6 views

A migration that takes 30 seconds in dev can lock your prod table for 10 minutes and take down checkout. Most Django outages I've seen weren't bad code — they were a single ALTER TABLE in a migration that looked harmless. This guide is the rules I follow before I let migrations near a 50-million-row table.

The thing that kills you: ACCESS EXCLUSIVE

PostgreSQL takes one of several lock levels for DDL. The dangerous one is ACCESS EXCLUSIVE: nothing reads, nothing writes. Operations that take it on a large table:

  • Adding a column with a non-null default (pre-PG11 — newer PG is lazy).
  • Adding NOT NULL to an existing column.
  • Changing a column type (e.g., VARCHAR(50)VARCHAR(100)).
  • Creating an index without CONCURRENTLY.

The fix: never let a migration both change schema and scan/rewrite the table in the same step.

Adding a NOT NULL column — the safe pattern

The naive way:

# BAD — locks the table for the duration of the rewrite
class Migration(migrations.Migration):
    operations = [
        migrations.AddField("Order", "country", models.CharField(max_length=2, default="NL")),
    ]

The safe way is three deploys:

  1. Deploy 1 — add nullable column.
    migrations.AddField("Order", "country",
        models.CharField(max_length=2, null=True, blank=True))
    
    Instant on PG11+. App still works (column is unused).
  2. Deploy 2 — backfill in chunks.
    # Run as a one-off management command, NOT a migration
    qs = Order.objects.filter(country__isnull=True)
    batch = 5000
    while qs.exists():
        ids = list(qs.values_list("id", flat=True)[:batch])
        Order.objects.filter(id__in=ids).update(country="NL")
        time.sleep(0.1)  # let replica catch up
    
    And update app code to write country on every new Order (so the gap closes from both ends).
  3. Deploy 3 — add NOT NULL.
    # PostgreSQL 12+
    ALTER TABLE orders ADD CONSTRAINT orders_country_notnull CHECK (country IS NOT NULL) NOT VALID;
    ALTER TABLE orders VALIDATE CONSTRAINT orders_country_notnull;
    ALTER TABLE orders ALTER COLUMN country SET NOT NULL;
    ALTER TABLE orders DROP CONSTRAINT orders_country_notnull;
    
    NOT VALID + VALIDATE avoids the table-rewriting lock. The final SET NOT NULL uses an existing valid CHECK and is near-instant.

Yes, three deploys. Yes, it's worth it. Your alternative is being on a Sunday-night call.

The rename dance — never RENAME directly

RENAME COLUMN is metadata-only and fast. The problem isn't PostgreSQL — it's app code. Between deploy and rollback you'd have running web/worker processes referring to the old name. Pattern:

  1. Add new column email_address (nullable).
  2. App writes to BOTH email and email_address.
  3. Backfill email_address from email.
  4. Switch reads to email_address.
  5. Stop writing to email.
  6. Drop email.

Six steps. Two are deploys; the others are background jobs. If anyone tells you they can rename a hot column in one deploy on a 100M-row table, ask to see the post-mortem.

Indexes — always CONCURRENTLY

class Migration(migrations.Migration):
    atomic = False  # CRITICAL — concurrently can't run in a transaction

    operations = [
        migrations.RunSQL(
            "CREATE INDEX CONCURRENTLY orders_email_idx ON orders (email);",
            reverse_sql="DROP INDEX CONCURRENTLY orders_email_idx;",
        ),
    ]

CONCURRENTLY doesn't take a write lock — but it can fail half-built. If it does, you'll see INVALID in \d; drop it and retry.

RunPython — the silent killer

RunPython in a migration runs during deploy, holding a transaction. If your data step touches 10 million rows, the deploy blocks for an hour and the worker pool fills up. Rule: RunPython is for tiny, bounded fixups (a few hundred rows max). Anything larger goes to a management command run separately, with batching and sleeps.

The multi-deploy mental model

Frame every schema change as: "What does this look like at any moment during the deploy, when half the workers are on old code and half on new?" If the answer involves "old code panics because country is NOT NULL but it doesn't know that field exists" — your migration is unsafe.

Test it

  • django-test-migrations: assert that migrations apply both forward and backward on realistic data.
  • squawk (linter): catches dangerous DDL in CI before it hits prod.
  • Run migrations against a recent prod backup in a staging env. Time them. A 10-minute migration in staging = a 10-minute outage in prod.

Summary

Zero-downtime is a discipline, not a tool. Add columns nullable, backfill in chunks, gate on NOT VALID constraints, never rename in one deploy, always create indexes CONCURRENTLY, and keep RunPython for tiny things. Read every migration with the question: "What if this took 10 minutes?" If the answer is "outage," redesign the migration.