Zero-Downtime PostgreSQL Migrations: A Battle-Tested Playbook
It’s 2 AM and your migration just locked a table with 50 million rows. Queries are timing out, the on-call Slack channel is exploding, and you’re frantically trying to remember if you can safely kill a running ALTER TABLE. Every senior engineer has a story like this—here’s how to make sure it’s not your story.
This playbook distills years of hard-won lessons from running PostgreSQL migrations on high-traffic production systems. You’ll learn the patterns that separate routine deployments from multi-hour incidents, and walk away with concrete techniques you can apply to your next schema change.
Why Production Migrations Are Different
Your migration script runs flawlessly in development. It passes CI. It works perfectly in staging. Then it brings down production for 45 minutes. This pattern repeats across organizations because engineers underestimate a fundamental difference: production has concurrent load.
In development, you’re the only connection to the database. When you run ALTER TABLE users ADD COLUMN last_login TIMESTAMP, PostgreSQL acquires an ACCESS EXCLUSIVE lock, modifies the table metadata, and releases the lock in milliseconds. No other queries are waiting, so you never notice the lock. Your tests pass because they run in isolation, typically against a database with hundreds of rows rather than millions.
Production tells a different story. That same ALTER TABLE statement needs the same ACCESS EXCLUSIVE lock, but now hundreds of queries are actively reading from and writing to the users table. PostgreSQL’s lock manager follows a strict queuing discipline: your DDL statement enters the lock queue behind active transactions, and crucially, all new queries queue behind your DDL statement. What should have been a sub-second operation becomes a cascading failure.
Lock contention is the silent killer of production deployments. The migration itself isn’t slow—acquiring the lock is what takes forever. A long-running SELECT statement that started before your migration will hold a conflicting lock, and your ALTER TABLE will wait. Every query that arrives after your migration attempt will also wait, forming an ever-growing queue. Connection pools exhaust. Timeouts trigger. The application becomes unresponsive. What makes this particularly insidious is that the problem compounds over time—the longer the lock waits, the more queries pile up behind it, and the worse the eventual impact becomes.
Understanding the distinction between blocking and non-blocking DDL operations is essential for production safety. Most DDL statements in PostgreSQL require ACCESS EXCLUSIVE locks, which conflict with every other lock type including simple SELECT statements. This means even read-only queries get blocked, not just writes. However, PostgreSQL provides several non-blocking alternatives that every production engineer should know: CREATE INDEX CONCURRENTLY builds indexes without blocking reads or writes, DROP INDEX CONCURRENTLY removes indexes safely, and certain ALTER TABLE variants complete without full table locks. For example, adding a nullable column without a default value is nearly instantaneous regardless of table size because PostgreSQL only updates the system catalog, not the actual table data.
The key insight is that migration safety in production isn’t about the complexity of your schema change—it’s about the lock duration multiplied by your query volume. A simple column addition on a table serving 10,000 queries per second is far more dangerous than a complex stored procedure change on a rarely-accessed audit log table. This multiplicative relationship means that even brief lock acquisitions can cause significant user-facing impact at scale. Planning migrations requires understanding both your schema changes and your traffic patterns.
The Expand-Contract Pattern
The expand-contract pattern transforms dangerous, all-or-nothing migrations into a series of safe, reversible steps. Instead of modifying schema in place, you expand the schema to support both old and new structures, migrate data and application code, then contract by removing the old structure. This approach has its roots in continuous delivery practices and has become the gold standard for zero-downtime database changes.
The fundamental principle is simple: never put your database and application in a state where they disagree about the schema. Traditional migrations create a moment where the database has the new schema but the application still expects the old one (or vice versa). The expand-contract pattern eliminates this dangerous window by ensuring both old and new schemas coexist during the transition.
Consider a common scenario: renaming a column from user_name to username. The naive approach runs a single migration:
-- DON'T DO THIS IN PRODUCTIONALTER TABLE users RENAME COLUMN user_name TO username;This statement acquires an ACCESS EXCLUSIVE lock and immediately breaks every query referencing user_name. Any application server that hasn’t been updated will throw errors, and even updated servers might fail if they have cached prepared statements. The expand-contract approach breaks this into four deployments:
Step 1: Expand the schema
-- Safe: adding a nullable column is nearly instantaneousALTER TABLE users ADD COLUMN username VARCHAR(255);
-- Create a trigger to keep both columns synchronizedCREATE OR REPLACE FUNCTION sync_username()RETURNS TRIGGER AS $$BEGIN IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN IF NEW.username IS NULL AND NEW.user_name IS NOT NULL THEN NEW.username := NEW.user_name; ELSIF NEW.user_name IS NULL AND NEW.username IS NOT NULL THEN NEW.user_name := NEW.username; END IF; END IF; RETURN NEW;END;$$ LANGUAGE plpgsql;
CREATE TRIGGER users_sync_username BEFORE INSERT OR UPDATE ON users FOR EACH ROW EXECUTE FUNCTION sync_username();The trigger ensures that writes to either column automatically populate the other, maintaining consistency regardless of which version of the application code is running. This bidirectional synchronization is crucial during rolling deployments where multiple application versions coexist.
Step 2: Backfill existing data
-- Run this as a batched operation (covered in Section 5)UPDATE users SET username = user_name WHERE username IS NULL;Step 3: Deploy application changes
Update your application code to read from and write to username. The trigger ensures that writes to either column keep both in sync. Deploy this change gradually—canary deployments or blue-green deployments work well here. Monitor error rates and performance metrics carefully before proceeding to full rollout.
Step 4: Contract the schema
-- Only after confirming the application works with the new columnDROP TRIGGER users_sync_username ON users;DROP FUNCTION sync_username();ALTER TABLE users DROP COLUMN user_name;This pattern maintains backward compatibility throughout the transition. If something goes wrong after step 1 or 2, you can roll back the application without any database changes. The synchronization trigger ensures data consistency during the transition window when both old and new application versions might be running. Each step is independently deployable and reversible, giving you multiple safe stopping points.
💡 Pro Tip: The expand phase should always be deployable independently of application changes. This gives you a natural rollback point—if the new column causes unexpected issues, simply drop it without touching application code. Consider waiting 24-48 hours between the expand and contract phases to ensure everything works correctly under real traffic patterns.
The expand-contract pattern applies to almost any schema change: adding columns, changing column types, splitting tables, or reorganizing relationships. The specific implementation varies, but the principle remains constant: never make a change that requires simultaneous database and application updates. This decoupling is what enables true zero-downtime deployments.
PostgreSQL Lock Management Deep Dive
PostgreSQL’s locking system provides correctness guarantees, but those guarantees come with operational costs you need to understand. Different DDL operations require different lock types, and knowing which locks conflict helps you predict migration behavior.
PostgreSQL implements a multi-level locking system with eight distinct lock modes arranged in a hierarchy of exclusivity. At the bottom is ACCESS SHARE, acquired by SELECT statements, which allows maximum concurrency. At the top is ACCESS EXCLUSIVE, required by most ALTER TABLE operations, which blocks everything. Between these extremes are locks like ROW EXCLUSIVE (used by UPDATE and DELETE), SHARE (used by CREATE INDEX without the CONCURRENTLY option), and SHARE UPDATE EXCLUSIVE (used by VACUUM and CREATE INDEX CONCURRENTLY).
The ACCESS EXCLUSIVE lock is the most restrictive—it conflicts with all other lock types, including ACCESS SHARE acquired by simple SELECT statements. Most ALTER TABLE operations require this lock: adding columns with defaults, adding constraints, changing column types, and renaming columns all need exclusive access. The SHARE UPDATE EXCLUSIVE lock is less restrictive and is used by CREATE INDEX CONCURRENTLY and VACUUM—it blocks only writes and other DDL operations, not reads.
Understanding lock queue behavior is critical. When a transaction requests a lock it cannot immediately acquire, it enters a wait queue. PostgreSQL processes this queue fairly, meaning later requests cannot jump ahead of earlier ones. This fairness guarantee, while preventing starvation, creates the cascading problem described earlier: your DDL statement waiting for a lock blocks all subsequent queries that need conflicting locks, even if those queries would be compatible with the currently held lock.
Your most important tool is lock_timeout. Setting this parameter before any DDL operation ensures your migration fails fast rather than causing cascading failures:
-- Always set lock_timeout before DDL in productionSET lock_timeout = '3s';
-- Now attempt the migrationALTER TABLE users ADD COLUMN preferences JSONB;
-- Reset for normal operationsRESET lock_timeout;If the lock cannot be acquired within 3 seconds, PostgreSQL aborts the statement with an error. This is dramatically better than waiting indefinitely while blocking all other queries. You can retry the migration during a lower-traffic period. The specific timeout value depends on your traffic patterns—3 seconds is a reasonable starting point for high-traffic tables, but you might use longer timeouts for low-traffic tables or shorter timeouts for extremely hot tables.
⚠️ Warning: A failed migration is always better than a cascading outage. Set
lock_timeoutbefore every DDL operation in production. Period. This single practice will prevent more incidents than any other technique in this guide.
For index creation on large tables, CREATE INDEX CONCURRENTLY is essential:
-- This blocks nothing - runs alongside production trafficCREATE INDEX CONCURRENTLY idx_users_email ON users(email);
-- Regular CREATE INDEX would block all writes to the table-- DON'T DO THIS on production tables:-- CREATE INDEX idx_users_email ON users(email);The concurrent variant takes longer to complete because it performs multiple table scans and waits for existing transactions, but it holds only a SHARE UPDATE EXCLUSIVE lock that doesn’t block normal reads and writes. The trade-off is worthwhile: a 10-minute concurrent index creation is far better than a 30-second blocking creation that causes 30 seconds of complete application unavailability. Note that CREATE INDEX CONCURRENTLY cannot run inside a transaction block, which requires special handling in migration frameworks.
When adding constraints, you can similarly split the operation into two phases:
-- Step 1: Add constraint as NOT VALID (instant, no data scan)SET lock_timeout = '3s';ALTER TABLE orders ADD CONSTRAINT orders_user_fk FOREIGN KEY (user_id) REFERENCES users(id) NOT VALID;
-- Step 2: Validate separately (scans data but doesn't block writes)ALTER TABLE orders VALIDATE CONSTRAINT orders_user_fk;Adding the constraint with NOT VALID only validates new rows going forward—it doesn’t scan existing data, making it nearly instantaneous. The separate VALIDATE CONSTRAINT step scans the table but holds a weaker SHARE UPDATE EXCLUSIVE lock that permits concurrent inserts, updates, and deletes. If the validation finds invalid data, it will fail, but your application continues running normally throughout the process.
For monitoring lock contention in real-time, query pg_stat_activity and pg_locks:
-- Find queries blocking or blocked by your migrationSELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query, blocking.pid AS blocking_pid, blocking.query AS blocking_query, blocking.state AS blocking_stateFROM pg_locks blocked_locksJOIN pg_stat_activity blocked ON blocked.pid = blocked_locks.pidJOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.relation = blocked_locks.relation AND blocking_locks.pid != blocked_locks.pidJOIN pg_stat_activity blocking ON blocking.pid = blocking_locks.pidWHERE NOT blocked_locks.granted;This query shows you exactly which queries are blocking your migration, allowing you to make informed decisions about whether to wait, cancel the blocking query, or abort your migration attempt. Run this query in a separate session before and during your migration to understand the lock landscape.
Implementing Safe Migrations with Alembic
Alembic is the de facto standard for SQLAlchemy migrations, but its default configuration prioritizes development convenience over production safety. A production-ready Alembic setup requires explicit configuration around lock management and transaction handling. Without these adjustments, Alembic migrations can easily cause the exact problems this guide aims to prevent.
The core issue is that Alembic, by default, runs all migrations in a single transaction without any lock timeout. This means a single stuck migration can block all database operations indefinitely. Additionally, Alembic doesn’t distinguish between operations that can run inside transactions and those that cannot (like CREATE INDEX CONCURRENTLY), leading to confusing failures when you first attempt concurrent operations.
Start by configuring Alembic’s env.py to set lock_timeout before running migrations:
from alembic import contextfrom sqlalchemy import pool, textfrom sqlalchemy.engine import Connection
def run_migrations_online() -> None: """Run migrations in 'online' mode with production-safe defaults."""
connectable = context.config.attributes.get("connection", None)
if connectable is None: from sqlalchemy import create_engine connectable = create_engine( context.config.get_main_option("sqlalchemy.url"), poolclass=pool.NullPool, )
with connectable.connect() as connection: # Set lock_timeout for all migration operations lock_timeout_ms = context.config.get_main_option( "lock_timeout_ms", "5000" ) connection.execute( text(f"SET lock_timeout = '{lock_timeout_ms}ms'") ) # Statement timeout as a safety net for runaway queries connection.execute( text("SET statement_timeout = '300s'") )
context.configure( connection=connection, target_metadata=target_metadata, compare_type=True, transaction_per_migration=True, # Each migration in its own transaction )
with context.begin_transaction(): context.run_migrations()The transaction_per_migration=True setting ensures each migration file runs in its own transaction. This prevents a failure in one migration from rolling back previous successful migrations, making your deployment state more predictable. Without this setting, a failure in migration 5 of 10 would roll back migrations 1-4 as well, even though they completed successfully.
For migrations involving concurrent index creation, you need to run outside a transaction:
"""Add email index concurrently.
Revision ID: abc123Revises: def456Create Date: 2026-02-09 10:00:00.000000"""from alembic import op
# This migration cannot run inside a transaction# because CREATE INDEX CONCURRENTLY requires itrevision = 'abc123'down_revision = 'def456'branch_labels = Nonedepends_on = None
def upgrade() -> None: # Disable automatic transaction for this migration op.execute("COMMIT")
# Now we can run concurrent operations op.create_index( 'idx_users_email', 'users', ['email'], unique=False, postgresql_concurrently=True, )
def downgrade() -> None: op.execute("COMMIT") op.drop_index( 'idx_users_email', table_name='users', postgresql_concurrently=True, )📝 Note: When using
postgresql_concurrently=True, the migration cannot be transactional. If the index creation fails partway through, you’ll have an invalid index that must be manually dropped withDROP INDEX CONCURRENTLY. Always check for invalid indexes after running concurrent migrations by queryingpg_indexfor entries whereindisvalid = false.
For handling lock acquisition failures gracefully, wrap your operations with retry logic:
"""Add preferences column with retry logic.
Revision ID: xyz789"""from alembic import opimport sqlalchemy as safrom sqlalchemy import textimport time
revision = 'xyz789'down_revision = 'abc123'
def upgrade() -> None: max_attempts = 5 base_delay = 2.0
for attempt in range(max_attempts): try: conn = op.get_bind() conn.execute(text("SET lock_timeout = '3s'"))
op.add_column( 'users', sa.Column('preferences', sa.JSON(), nullable=True) ) return # Success, exit the retry loop
except sa.exc.OperationalError as e: if "lock timeout" in str(e).lower() and attempt < max_attempts - 1: delay = base_delay * (2 ** attempt) # Exponential backoff print(f"Lock timeout, retrying in {delay}s (attempt {attempt + 1})") time.sleep(delay) else: raise
def downgrade() -> None: op.drop_column('users', 'preferences')This pattern implements exponential backoff for lock timeouts, automatically retrying during transient lock contention while still failing definitively if the lock cannot be acquired after multiple attempts. The exponential backoff (2s, 4s, 8s, 16s) gives progressively more time for blocking transactions to complete while still failing within a reasonable timeframe.
Consider also creating a base migration class or utility functions that encapsulate these safety patterns, making it easy for team members to write safe migrations without remembering all the details each time.
Backfilling Data Without Bringing Down the Database
When you add a new column that requires populating data from existing columns, a single UPDATE statement is a production hazard. Updating 50 million rows in one transaction holds row locks for the entire duration, generates massive WAL traffic, and competes with production queries for I/O bandwidth. The transaction log can grow to enormous sizes, replication lag spikes, and if anything goes wrong, the entire operation rolls back, wasting all that work.
The solution is batched updates with throttling. By breaking the update into small chunks and committing each chunk independently, you release locks quickly, allow replication to keep up, and provide natural checkpoints for resumability.
"""Backfill script for migrating user_name to username column.Run this after the expand migration, before the contract migration."""import timeimport psycopg2from psycopg2.extras import execute_valuesfrom contextlib import contextmanager
# ConfigurationDATABASE_URL = "postgresql://user:pass@localhost/myapp"BATCH_SIZE = 1000SLEEP_BETWEEN_BATCHES = 0.1 # 100ms pause between batchesPROGRESS_INTERVAL = 10000
@contextmanagerdef get_connection(): conn = psycopg2.connect(DATABASE_URL) try: yield conn finally: conn.close()
def get_total_count(conn) -> int: """Get count of rows needing backfill.""" with conn.cursor() as cur: cur.execute( "SELECT COUNT(*) FROM users WHERE username IS NULL AND user_name IS NOT NULL" ) return cur.fetchone()[0]
def backfill_batch(conn, last_id: int) -> tuple[int, bool]: """ Process one batch of rows. Returns (new_last_id, has_more_rows). """ with conn.cursor() as cur: # Fetch batch of IDs to update cur.execute( """ SELECT id FROM users WHERE id > %s AND username IS NULL AND user_name IS NOT NULL ORDER BY id LIMIT %s """, (last_id, BATCH_SIZE) ) rows = cur.fetchall()
if not rows: return last_id, False
ids = [row[0] for row in rows] max_id = ids[-1]
# Update the batch cur.execute( """ UPDATE users SET username = user_name WHERE id = ANY(%s) """, (ids,) )
conn.commit() # Commit each batch separately return max_id, True
def run_backfill(): """Execute the backfill with progress reporting.""" with get_connection() as conn: total = get_total_count(conn) print(f"Starting backfill of {total:,} rows")
processed = 0 last_id = 0 start_time = time.time()
while True: last_id, has_more = backfill_batch(conn, last_id)
if not has_more: break
processed += BATCH_SIZE
if processed % PROGRESS_INTERVAL == 0: elapsed = time.time() - start_time rate = processed / elapsed remaining = (total - processed) / rate if rate > 0 else 0 print( f"Processed {processed:,}/{total:,} rows " f"({processed/total*100:.1f}%) - " f"ETA: {remaining/60:.1f} minutes" )
# Throttle to leave headroom for production traffic time.sleep(SLEEP_BETWEEN_BATCHES)
elapsed = time.time() - start_time print(f"Backfill complete: {processed:,} rows in {elapsed:.1f}s")
if __name__ == "__main__": run_backfill()Several design decisions make this script production-safe:
Small transactions: Each batch commits independently, releasing locks quickly and allowing replication to keep up. If the script crashes or is stopped, you lose at most one batch of work. This also keeps the transaction log manageable—instead of one giant transaction, you have thousands of small ones.
Ordered processing: Using ORDER BY id with a cursor (WHERE id > last_id) ensures predictable progress and resumability. If you need to restart, you can adjust last_id to continue from where you stopped. This approach also provides consistent performance because you’re always working with a known, indexed access pattern.
Sleep intervals: The SLEEP_BETWEEN_BATCHES parameter leaves CPU and I/O headroom for production queries. Monitor your database load and adjust this value accordingly. During peak traffic, you might increase the sleep interval or pause the backfill entirely.
Progress reporting: For long-running backfills, visibility into progress prevents unnecessary anxiety and helps estimate completion time. Knowing you’re 60% done with 2 hours remaining is much better than wondering if the script is even working.
💡 Pro Tip: Run backfills during off-peak hours and monitor replication lag. If replicas start falling behind, increase
SLEEP_BETWEEN_BATCHESor pause the backfill entirely. Consider setting up alerts on replication lag specifically during backfill operations.
For extremely large tables, consider running the backfill in parallel across ID ranges:
from concurrent.futures import ThreadPoolExecutorimport threading
def backfill_range(start_id: int, end_id: int, thread_id: int): """Backfill a specific ID range.""" # Similar to backfill_batch but constrained to [start_id, end_id] pass
def run_parallel_backfill(num_workers: int = 4): """Split the table into ranges and process in parallel.""" with get_connection() as conn: with conn.cursor() as cur: cur.execute("SELECT MIN(id), MAX(id) FROM users") min_id, max_id = cur.fetchone()
range_size = (max_id - min_id) // num_workers ranges = [ (min_id + i * range_size, min_id + (i + 1) * range_size) for i in range(num_workers) ] ranges[-1] = (ranges[-1][0], max_id) # Include remainder in last range
with ThreadPoolExecutor(max_workers=num_workers) as executor: futures = [ executor.submit(backfill_range, start, end, i) for i, (start, end) in enumerate(ranges) ] for future in futures: future.result() # Wait for completion and propagate exceptionsBe cautious with parallelism—more workers means more database connections and more I/O contention. Test with different worker counts to find the sweet spot for your specific hardware and workload.
Building Your Migration Safety Checklist
Before running any migration in production, systematically evaluate the risk using a standardized checklist. This process surfaces problems before they become incidents. Having a written checklist ensures consistency across team members and prevents the “I forgot to check” scenarios that lead to outages.
Pre-flight checks:
-
Table size and row count: Query
pg_total_relation_size()andCOUNT(*)for affected tables. A 100MB table behaves very differently from a 100GB table. Knowing the size helps you estimate lock acquisition time and backfill duration. -
Active connections and query patterns: Check
pg_stat_activityfor long-running queries on the target tables. A query that’s been running for 10 minutes will block your migration for its entire remaining duration. Consider whether you need to coordinate with teams running analytics queries or batch jobs. -
Replication lag: If you’re running replicas, check their lag before starting. Migrations generate WAL traffic that replicas must process. Starting a migration with existing lag compounds the problem and can cause replicas to fall unacceptably far behind.
-
Lock analysis: Determine what locks your migration requires. Will a simple
SELECTblock the migration? Will the migration blockSELECTstatements? Use the PostgreSQL documentation or test in staging to understand the exact lock behavior. -
Rollback plan: Know exactly how you’ll reverse the change if needed. For expand-contract migrations, this is straightforward. For other changes, script the rollback in advance and test it. A rollback plan that hasn’t been tested isn’t really a plan.
-
Disk space: Ensure sufficient disk space for WAL generation, especially for large backfills. Running out of disk space during a migration is catastrophic.
Timing considerations:
Schedule high-risk migrations during low-traffic windows. Analyze your traffic patterns to identify the quietest periods—often early morning on weekdays or weekend nights, depending on your user base. The reduced query volume means faster lock acquisition and smaller blast radius if something goes wrong.
Avoid running migrations during:
- Marketing campaigns or product launches
- Month-end processing for B2B applications
- Immediately before holidays when on-call coverage is thin
- During other maintenance activities that might compete for resources
Communication protocols:
For migrations classified as high-risk (large tables, new lock types, first-time patterns), implement a formal communication protocol:
- Announce the migration window in advance through your team’s communication channel
- Confirm on-call engineers are available and aware
- Have a designated rollback decision-maker who can act quickly
- Set up monitoring dashboards before starting
- Communicate completion or incident status immediately after
Risk classification matrix:
Categorize migrations by their risk profile to match process rigor to risk level:
- Low risk: Adding nullable columns to small tables, creating indexes concurrently on tables under 1GB
- Medium risk: Adding constraints, backfilling data, modifying column defaults, indexes on tables over 1GB
- High risk: Changing column types, dropping columns, any operation on tables with >10M rows or during peak traffic
Low-risk migrations can run during normal deployment windows. High-risk migrations warrant dedicated change windows with full team awareness.
When Things Go Wrong: Recovery Patterns
Despite careful planning, migrations sometimes fail or cause unexpected problems. Your response in the first few minutes determines whether you have a brief incident or an extended outage. Having practiced recovery procedures before you need them makes all the difference.
Safely canceling stuck migrations:
If a migration is blocked waiting for a lock, you can cancel it with:
SELECT pg_cancel_backend(pid); -- Graceful cancellationSELECT pg_terminate_backend(pid); -- Force termination if cancel doesn't workFind the migration’s PID from pg_stat_activity. Cancellation rolls back any partial work within the transaction—this is safe for transactional migrations but requires careful consideration for CREATE INDEX CONCURRENTLY, which leaves an invalid index on failure. After terminating a concurrent index creation, check for and clean up invalid indexes:
-- Find invalid indexesSELECT indexrelid::regclass FROM pg_index WHERE NOT indisvalid;
-- Drop invalid index if foundDROP INDEX CONCURRENTLY idx_name;⚠️ Warning: Never kill a migration blindly. Check
pg_stat_activityto confirm you’re terminating the correct backend, and understand what state the database will be in after termination. Killing the wrong process can cause data corruption or worse.
Forward vs. backward recovery:
When a migration partially completes or causes application errors, you face a decision: push forward to complete the change, or roll back to the previous state.
Choose forward recovery when:
- The migration is mostly complete and finishing is faster than reverting
- Rolling back would lose significant data or work
- The issue is in application code, not the schema change itself
- You have a clear path to completion
Choose backward recovery when:
- The migration caused data corruption or loss
- Application errors are widespread and affecting users
- You don’t fully understand why the migration failed
- The risk of continuing outweighs the cost of reverting
For expand-contract migrations, backward recovery is usually straightforward: the old structure still exists and works. For destructive migrations (dropped columns, changed types), forward recovery may be your only option, which is why non-destructive patterns are so valuable.
Post-incident analysis:
After resolving the immediate issue, conduct a blameless post-mortem focused on systemic improvements:
- Why didn’t testing catch this? Were staging conditions sufficiently production-like?
- What monitoring would have detected the problem earlier?
- How can the pre-flight checklist be improved?
- Does the team need additional training on PostgreSQL lock mechanics?
- Should the migration have been classified as higher risk?
Document the incident and its resolution. These records become invaluable references for future migrations and for onboarding new team members. Each incident is a learning opportunity that makes your team better prepared for the next challenge.
Key Takeaways
- Always set
lock_timeoutbefore running DDL in production—a failed migration is better than a cascading outage - Use the expand-contract pattern to break dangerous migrations into multiple deployments with rollback points
- Understand PostgreSQL’s lock hierarchy and choose non-blocking alternatives like
CREATE INDEX CONCURRENTLYwhenever possible - Batch all data backfills and include sleep intervals to leave capacity for production traffic
- Build a pre-flight checklist that includes table size, peak traffic windows, and replication lag before any migration
- Configure Alembic with
transaction_per_migration=Trueand implement retry logic with exponential backoff for lock timeouts - Have tested rollback procedures ready before you start—a plan that hasn’t been tested isn’t really a plan