Database Migrations in Production: How to Not Wake Up at 3am

We've both been that developer. The one who runs a migration on a live database, watches the site go down for 45 seconds, and refreshes the error monitoring dashboard like pressing F5 faster will somehow fix it. You learn zero-downtime migrations the same way you learn not to touch a hot stove — by getting burned.

The good news: there are only a handful of patterns you actually need. The bad news: they require you to think about your schema changes differently than you probably do today. This post is everything we wish someone had told us before we started shipping migrations to real users.

Why Migrations Break Production (And It's Not the Obvious Reason)

Most developers think downtime from migrations comes from long-running ALTER TABLE statements locking rows. That's real, but it's not the most common cause. The most common cause is deploying code and schema changes at the same time, in the wrong order.

Here's the scenario: you rename a column from `user_name` to `username`. You run the migration and deploy the new code simultaneously. For 30 seconds during the deploy, you have old code instances still running that reference `user_name` — which no longer exists. Every query they touch throws. Your users see 500 errors. You see your career flash before your eyes.

The fix isn't faster deploys. It's accepting that your database schema and your application code need to be able to coexist across versions. This is called the expand/contract pattern, and once it clicks, everything else makes sense.

The Expand/Contract Pattern (Just Do This)

Instead of making breaking schema changes in one shot, you split every change into three phases. Expand, then transition, then contract. It takes longer but it's the only approach that actually works at production scale.

Expand: Add the new thing without removing the old thing. New column alongside old column. New table alongside old table.
Transition: Deploy code that writes to BOTH old and new. Backfill existing data. Keep this running long enough that no old code instances are left.
Contract: Remove the old thing. By now, no running code depends on it.

Let's walk through renaming that column properly. This is a real migration sequence using Drizzle, which is what we use in most of our projects.

// Phase 1: EXPAND — add the new column, keep the old one
// migration: 0042_add_username_column.ts
export async function up(db: NodePgDatabase) {
  await db.execute(sql`
    ALTER TABLE users
    ADD COLUMN username TEXT;
  `);
}

// Phase 2: TRANSITION — deploy code that writes to both columns
// In your application code:
async function updateUser(userId: string, name: string) {
  await db
    .update(users)
    .set({
      user_name: name,  // old column — keep writing here
      username: name,   // new column — start writing here too
    })
    .where(eq(users.id, userId));
}

// Also run a backfill migration to fill gaps:
export async function up(db: NodePgDatabase) {
  await db.execute(sql`
    UPDATE users
    SET username = user_name
    WHERE username IS NULL;
  `);
}

// Phase 3: CONTRACT — remove the old column
// Only after ALL instances are on the new code
// migration: 0044_drop_user_name_column.ts
export async function up(db: NodePgDatabase) {
  await db.execute(sql`
    ALTER TABLE users
    DROP COLUMN user_name;
  `);
}

Yes, this is three separate migration files and probably two deploys. Yes, it takes more time. No, your users will never notice anything happened.

The Migrations That Actually Cause Locks

Not all DDL statements are equal. Some are instant. Some will lock your table for minutes on a large dataset. Knowing which is which saves you from scheduling a maintenance window for something that didn't need one.

On PostgreSQL (which you're probably using), adding a nullable column is instant. The database doesn't need to rewrite anything. Adding a column with a DEFAULT value used to require a full table rewrite in older Postgres versions, but since Postgres 11 it's also instant for most cases. Adding NOT NULL to an existing column without a default? That scans every row. Nightmare on a big table.

-- FAST: Adding nullable column (instant)
ALTER TABLE orders ADD COLUMN notes TEXT;

-- FAST on Postgres 11+: Adding column with default (no rewrite)
ALTER TABLE orders ADD COLUMN status TEXT NOT NULL DEFAULT 'pending';

-- SLOW: Adding NOT NULL constraint to existing column with data
-- This scans every row
ALTER TABLE orders ALTER COLUMN notes SET NOT NULL;

-- BETTER way to add NOT NULL to existing column:
-- Step 1: Add as nullable
ALTER TABLE orders ADD COLUMN notes TEXT;
-- Step 2: Backfill
UPDATE orders SET notes = '' WHERE notes IS NULL;
-- Step 3: Add constraint (still a scan, but now you control the timing)
ALTER TABLE orders ALTER COLUMN notes SET NOT NULL;

-- EVEN BETTER for huge tables: use NOT VALID first
ALTER TABLE orders ADD CONSTRAINT orders_notes_not_null
  CHECK (notes IS NOT NULL) NOT VALID;
-- Validate separately (doesn't lock writes, just reads)
ALTER TABLE orders VALIDATE CONSTRAINT orders_notes_not_null;

Creating an index is the other big one. A regular CREATE INDEX takes an ACCESS SHARE lock that blocks writes. Always use CREATE INDEX CONCURRENTLY. It takes longer but doesn't block anything.

-- DON'T: Blocks all writes while building the index
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- DO: Takes longer but doesn't block writes
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);

-- If you're using Drizzle migrations, you can drop down to raw SQL:
// In your migration file:
export async function up(db: NodePgDatabase) {
  // Note: CONCURRENTLY can't run inside a transaction
  // so we need to exit the transaction first
  await db.execute(sql`COMMIT`);
  await db.execute(sql`
    CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_user_id
    ON orders(user_id);
  `);
}

If your migration file has any ALTER TABLE on a table with more than a few hundred thousand rows, stop and think. You need a plan for that one.

Running Migrations: When and How

There are two schools of thought: run migrations automatically as part of the deploy, or run them manually before deploying. We've tried both. Here's our honest take.

Automatic migrations in a deploy pipeline are great for development and staging. For production, they make us nervous. The problem is that if the migration fails halfway through, you now have a half-migrated database AND a failed deploy. Untangling that at 11pm is not fun. We've been there.

Our current approach: migrations run as a separate step before the deploy, with a clear rollback path. If the migration fails, the deploy doesn't happen. The old code is still running against the old schema, which still works.

// migrate.ts — run this BEFORE deploying new code
import { drizzle } from 'drizzle-orm/node-postgres';
import { migrate } from 'drizzle-orm/node-postgres/migrator';
import { Pool } from 'pg';

async function runMigrations() {
  const pool = new Pool({
    connectionString: process.env.DATABASE_URL,
  });

  const db = drizzle(pool);

  console.log('Running migrations...');

  try {
    await migrate(db, { migrationsFolder: './drizzle' });
    console.log('Migrations complete.');
  } catch (error) {
    console.error('Migration failed:', error);
    process.exit(1); // Exit with error — stops the deploy pipeline
  } finally {
    await pool.end();
  }
}

runMigrations();

In your CI/CD pipeline (GitHub Actions, whatever you're using), you'd run `npx tsx migrate.ts` as a step before the deploy step. If it exits with code 1, the deploy step never runs.

Rollbacks: The Thing Nobody Plans For

Here's the uncomfortable truth: most schema changes can't be automatically rolled back. You can roll back code, but if you dropped a column and need to roll back, that data is gone. This is why the expand/contract pattern matters — during the expand phase, rollback is trivial. You just don't deploy the new code.

Write your down migrations. Even if you never use them, the act of thinking through what it takes to reverse a change forces you to think more carefully about it. Drizzle generates these for you; don't delete them.

For the contract phase (when you're dropping columns or tables), always take a backup snapshot first. On most managed database providers this is one CLI command. We do this religiously now after a fun incident where we dropped a column that turned out to still be referenced by a background job that ran once a week. Seven days of data, gone. The weekly job ran, errored, and we didn't notice for another week.

# Before any destructive migration, snapshot first
# On Supabase:
supabase db dump --db-url "$DATABASE_URL" -f backup_$(date +%Y%m%d_%H%M%S).sql

# On Neon:
neon branches create --name pre-migration-backup

# On Railway, use their backup UI or:
pg_dump "$DATABASE_URL" > backup_$(date +%Y%m%d_%H%M%S).sql

# Then run your migration
npx tsx migrate.ts

# If something goes wrong:
psql "$DATABASE_URL" < backup_20240315_143022.sql

Testing Migrations Before They Hit Production

Your staging environment should have a copy of production data (anonymized). Running migrations on a staging database that has 50 rows when production has 50 million tells you nothing useful about performance. This is a pain to set up but saves you from surprises.

If you're on Neon or PlanetScale, branching makes this stupid easy. Create a branch from production data, run the migration there, check the timing, then run it on production with confidence.

Always test migrations against a dataset close to production size before running them live
Use EXPLAIN ANALYZE on queries after migrations to make sure indexes are being used
Check pg_stat_activity during the migration to see if anything is being blocked
Set a statement_timeout on non-critical migrations so they fail fast instead of locking forever
Add migration checkpoints in your deploy notes — 'migration ran at 14:32, took 8 seconds'

One more thing: run your migrations idempotently where possible. `CREATE TABLE IF NOT EXISTS`, `CREATE INDEX CONCURRENTLY IF NOT EXISTS`, `ALTER TABLE ... ADD COLUMN IF NOT EXISTS`. If a migration gets interrupted and you need to re-run it, idempotent migrations save you from manually cleaning up a partial state.

The Checklist We Actually Use

We have this as a literal checklist in our deploy process for any non-trivial migration. Steal it.

Is this migration reversible? If not, do you have a backup?
Does this migration require the expand/contract pattern (i.e., is it a rename, delete, or type change)?
Are there any long-running queries on this table that need to finish first?
Are you using CREATE INDEX CONCURRENTLY for any new indexes?
Is this migration idempotent?
Did you test this on a staging environment with production-scale data?
Does the old code still work with the new schema (expand phase check)?
Does the new code still work with the old schema (deploy safety check)?

The last two are the most important. Before any deploy, we ask: if we had to run the old code against the new schema for 10 minutes, would anything break? If the answer is yes, the migration isn't ready.

A good migration is invisible. Your users should never be able to tell it happened.

If you're starting a new project and want this whole workflow baked in from day one — Drizzle set up, migration scripts ready, proper database patterns in place — the templates at peal.dev come with this already configured. Easier than setting it up from scratch when you're already trying to build the actual product.

The real takeaway here isn't any specific command or migration tool. It's that schema changes are code changes, and they deserve the same care as your application logic. Probably more, because rollback is a lot harder when the data is already gone. Slow down on migrations. Your future self at 3am will thank you.