Database Advanced

Zero-Downtime Migrations for Django and PostgreSQL: Locks, Backfills, and the Rename Dance

How to ship schema changes without locking your production database. NOT NULL adds, column renames, index creation, and the multi-deploy patterns that separate Django teams who break prod from those who don't.

DjangoZen Team Apr 25, 2026 19 min read 147 views

A database migration that takes a lock on a large table can freeze your entire application for the duration — seconds or minutes during which every query waits and users see errors. As your tables grow and you deploy continuously, the naive "just run migrate" approach becomes a recipe for outages. Zero-downtime migrations are the discipline of evolving your schema without ever blocking traffic, and this tutorial covers the locking knowledge, the multi-step patterns, and the backfill techniques that make it possible.

Why migrations cause downtime

The core problem is locks. To change a table's structure, PostgreSQL must take a lock, and some operations take an exclusive lock that blocks all reads and writes to that table while they run. On a tiny table this is instant and harmless; on a table with millions of rows, an operation that rewrites the table or scans it holds that lock long enough to stall your application completely. Meanwhile, queries pile up behind the lock, connections exhaust, and the site effectively goes down. Understanding that downtime comes from locks held too long, not from migrations per se, is the key insight: the goal is to change the schema using only operations that lock briefly or not at all.

Understanding PostgreSQL lock levels

PostgreSQL has a hierarchy of lock strengths, and the difference between a safe migration and an outage is which lock your operation takes. An ACCESS EXCLUSIVE lock blocks everything — reads and writes — and is what dangerous operations grab. Weaker locks allow concurrent reads or even writes. The art of zero-downtime migration is knowing which DDL operations take which locks, and on modern PostgreSQL many operations have been made far cheaper than they used to be. But the danger is not only the lock's strength — it is also how long it is held, because even a brief exclusive lock can cause problems if it must wait behind a long-running query first. Knowing your lock levels turns migration from guesswork into engineering.

Operations that are safe

Some schema changes are inherently cheap because they take only a brief lock and do not rewrite data. Adding a new nullable column is fast on modern PostgreSQL — it is a metadata change, not a table rewrite. Creating an index concurrently avoids locking writes. Adding a table, or adding a constraint as NOT VALID and validating it separately, are safe. These operations form the building blocks of zero-downtime changes: when you can express a change using only cheap operations, you can deploy it without a second thought. Much of the discipline is learning to decompose a risky change into a sequence of these individually-safe steps.

Operations that are dangerous

Other operations are traps on a large table. Adding a column with a non-constant default historically rewrote the entire table under an exclusive lock. Adding a NOT NULL constraint directly scans every row while locking. Changing a column's type often rewrites the table. Adding a foreign key validates every existing row under a lock. Creating an index non-concurrently blocks writes for the whole build. Each of these can freeze a big table for an unacceptable time. The skill is recognizing these dangerous operations and knowing the safe, multi-step alternative for each — because the naive single migration that does them directly is exactly what takes your site down.

Building indexes concurrently

Indexes are essential but building one normally locks the table against writes for the entire build, which on a large table means a long outage. PostgreSQL's CREATE INDEX CONCURRENTLY builds the index without blocking writes, at the cost of taking longer and not running inside a transaction. Django supports this with AddIndexConcurrently and atomic-migration handling. The concurrent build is the standard way to add an index to a busy production table safely. It is a perfect microcosm of the whole topic: the same logical change (add an index) has a dangerous form and a safe form, and choosing the safe form is what keeps traffic flowing while the change happens.

The expand-contract pattern

The master technique for risky changes is expand-contract (also called parallel change), which splits a change across multiple deploys so the database and code are always compatible. You expand by adding the new schema alongside the old, deploy code that writes to both and reads as appropriate, backfill the old data into the new shape, then contract by removing the old schema once nothing uses it. At no point is the running code incompatible with the database, so there is never a moment requiring downtime. This pattern underlies nearly every safe version of an otherwise-dangerous change, and internalizing it is the heart of zero-downtime migration thinking.

The rename dance

Renaming a column looks trivial but is a classic downtime trap, because the moment you rename it, the old code still running (during a rolling deploy) references a column that no longer exists, and breaks. The safe "rename dance" is pure expand-contract: add the new column, deploy code that writes to both old and new, backfill existing rows, deploy code that reads from the new column, and finally drop the old column in a later deploy. It takes several deploys to accomplish what looks like a one-line change, but each step keeps old and new code working against the database. The rename dance is the canonical example of why zero-downtime migrations require patience and multiple releases.

Adding NOT NULL safely

Adding a NOT NULL constraint directly forces a full-table scan under a lock to verify every row. The safe approach decomposes it: add the column nullable, backfill it so no nulls remain, add a CHECK (column IS NOT NULL) NOT VALID constraint (which is cheap because it does not scan), then VALIDATE the constraint in a separate step (which scans but takes only a weak lock allowing concurrent writes). This sequence achieves the same end state — a column that cannot be null — without ever holding a blocking lock long enough to matter. It is a textbook example of trading one dangerous operation for several cheap ones.

Backfilling large tables

Populating a new column or migrating data across millions of rows must not happen in a single transaction, because one giant UPDATE locks every row it touches and holds a transaction open for a long time, bloating the table and risking timeouts. Instead, backfill in batches: update a few thousand rows at a time, in separate transactions, with a brief pause between batches to let other queries through. This keeps each lock short and the database responsive throughout. Backfills are often run as a separate management command or task rather than inside the migration itself, precisely so they can be batched and monitored. Patient, batched backfilling is how you move large amounts of data without anyone noticing.

Changing a column type

Altering a column's type frequently rewrites the entire table under a lock, making it one of the more dangerous operations. The zero-downtime approach is again expand-contract: add a new column of the desired type, write to both columns from application code, backfill the new column in batches, switch reads to the new column, and drop the old one later. For some type changes there are shortcuts, but the general, always-safe method is to treat it as adding a new column and migrating to it gradually. The instinct to "just change the type" in one migration is exactly what locks a large table for minutes, so the multi-step path is the price of safety.

Adding foreign keys without locking

Adding a foreign key constraint normally validates every existing row against the referenced table while holding a lock. PostgreSQL lets you split this: add the constraint as NOT VALID, which is cheap and immediately enforces the constraint for new and modified rows, then run VALIDATE CONSTRAINT separately, which checks existing rows under a weaker lock that permits concurrent activity. The end result is a fully valid foreign key, achieved without a long blocking validation. This two-step approach mirrors the others in this tutorial — separate the cheap structural change from the expensive verification, and run the expensive part in a way that does not block traffic.

Lock timeouts as a safety net

Even a "safe" operation can cause trouble if it must wait behind a long-running query to acquire its lock — while it waits, it queues, and everything behind it queues too, cascading into a stall. Setting a lock_timeout before a migration is a crucial safety net: if the operation cannot get its lock quickly, it fails fast instead of waiting and blocking the world. You then retry when the table is quieter. This prevents the scenario where a quick DDL operation, blocked behind a slow report query, brings down the site by holding everything else hostage in the lock queue. A short lock timeout turns a potential outage into a harmless, retryable failure.

Tooling and testing

You do not have to carry all this lock knowledge in your head. Tools can analyze Django migrations and warn when an operation is unsafe on a large table, catching dangerous migrations in code review before they reach production. Beyond tooling, test migrations against a realistic copy of production data, not an empty development database — a migration that is instant on ten rows can lock for minutes on ten million. Reviewing migrations specifically for locking behavior, and rehearsing them at production scale, is what catches the dangerous ones before they cause an incident. Make migration safety an explicit part of your review and testing process, not an afterthought discovered in production.

Coordinating code and schema across deploys

The deeper principle behind zero-downtime migrations is that during a rolling deploy, old and new code run simultaneously against one database, so the schema must be compatible with both at every moment. This forces a particular choreography: schema changes that only add are deployed before the code that uses them, and schema changes that remove come after the code that stopped using them. A single logical change — rename a field, drop a column — becomes a sequence of deploys, each keeping every running version of your code working. Internalizing that the database is shared by multiple code versions during every deploy is what makes the multi-step patterns feel necessary rather than pedantic.

Reversible migrations and rollback

Deploys sometimes go wrong, and you may need to roll the code back — which means the database state must remain compatible with the previous version too. This is another reason destructive changes are deferred: if you drop a column and then must roll back to code that expects it, you are stuck. Keeping migrations reversible where possible, and sequencing destructive operations to come only after a change has proven stable, preserves your ability to retreat safely. Designing migrations with rollback in mind — not just forward success — is part of treating schema change as a careful, reversible process rather than a one-way leap you cannot undo if the deploy misbehaves.

Data migrations versus schema migrations

Migrations come in two flavors with different risks. Schema migrations change structure; data migrations transform the contents of rows. Data migrations are easy to get wrong at scale — a migration that rewrites every row in one transaction locks the table and can run for a very long time, or time out entirely. Keep data migrations separate from schema changes, run large data transformations in batches outside the migration where they can be monitored and paused, and test them against realistic data volumes. The instinct to bundle a data backfill into a schema migration is exactly what turns a quick structural change into a long-running, table-locking operation.

Verifying migration safety automatically

Lock knowledge is hard to keep in everyone's head, so automate the checking. Tools can analyze Django migrations and flag operations that are unsafe on a large table — a direct NOT NULL addition, a non-concurrent index, a risky type change — catching them in code review before they ever run in production. Combined with testing migrations against a production-scale dataset rather than an empty dev database, this makes migration safety a systematic part of your process rather than relying on individual vigilance. Making the dangerous patterns visible automatically is what scales zero-downtime discipline across a team, so a junior developer's risky migration is caught by tooling, not by an outage.

Summary

Zero-downtime migrations rest on one insight: downtime comes from locks held too long, so the goal is to change your schema using only operations that lock briefly or not at all. Learn which operations are cheap (adding nullable columns, concurrent indexes, NOT VALID constraints) and which are dangerous (direct NOT NULL, type changes, non-concurrent indexes, validated foreign keys on large tables). Decompose every risky change with the expand-contract pattern — add the new alongside the old, migrate gradually, remove the old later — which keeps database and code compatible at every step, as the rename dance and safe NOT NULL addition illustrate. Backfill large data changes in batches across many transactions, set a lock_timeout so blocked operations fail fast instead of cascading, and test migrations against production-scale data with tooling that flags unsafe ones. It takes more deploys and more patience than a naive migration, but it is the difference between evolving your schema invisibly and freezing your site every time a table grows.