Database Migrations in Production: Zero-Downtime Strategies That Actually Work

We once ran a migration that added a NOT NULL column to a table with 2 million rows. Deployed it on a Friday afternoon. The site was down for 11 minutes. We learned a lot that day — mostly about regret, but also about database migrations.

The thing about database migrations is that they look safe in development. You run `db:migrate`, everything works, you move on. Production is a different beast. Your app is getting traffic. Multiple server instances are running. The database is the one piece of infrastructure you can't just restart and hope for the best.

This post covers the patterns we actually use now. Not theoretical stuff — real strategies for Next.js apps with Postgres (Neon, Supabase, RDS, whatever), using Drizzle or Prisma.

Why Migrations Break Production

Before the strategies, you need to understand why naive migrations cause downtime. There are two main culprits: locks and deploy ordering.

When you run ALTER TABLE to add a column, rename something, or add an index, Postgres acquires a lock on that table. Heavy operations can hold that lock for seconds or minutes. Every query touching that table queues up. Your connection pool fills. Requests time out. Users see 500 errors.

The deploy ordering problem is subtler. Say you rename a column from `user_name` to `username`. You deploy the migration first, then the new app code. For the 30 seconds between those two deploys, the old code is still running and querying `user_name` — which no longer exists. Instant errors.

The fundamental rule of zero-downtime migrations: your database must be compatible with both the old version of your code AND the new version, at the same time.

The Expand-Contract Pattern

This is the foundation of everything. Instead of doing dangerous migrations in one step, you split them into phases that are individually safe.

Expand: Add new things (columns, tables, indexes). Never remove anything yet.
Migrate: Backfill data, update application code to use the new structure.
Contract: Remove the old stuff once no code depends on it anymore.

Renaming a column the safe way takes three deploys instead of one. Yes, it's more work. It's also the reason you sleep through the night.

Deploy 1 — Add the new column, keep writing to the old one, start dual-writing:

-- Migration: add new column (safe, no lock issues)
ALTER TABLE users ADD COLUMN username TEXT;

-- Backfill existing data
UPDATE users SET username = user_name WHERE username IS NULL;

Deploy 2 — Update your app to read from the new column, write to both:

// During transition: write to both columns, read from new one
async function updateUsername(userId: string, newUsername: string) {
  await db
    .update(users)
    .set({
      username: newUsername,    // new column
      user_name: newUsername,   // old column (still needed for old code instances)
    })
    .where(eq(users.id, userId));
}

async function getUser(userId: string) {
  const user = await db.query.users.findFirst({
    where: eq(users.id, userId),
  });
  // Read from new column, fall back to old if needed
  return {
    ...user,
    username: user?.username ?? user?.user_name,
  };
}

Deploy 3 — Once you're confident the new column has all data and no code references the old one, drop it:

-- Migration: drop old column (safe, no more code uses it)
ALTER TABLE users DROP COLUMN user_name;

Adding Indexes Without Locking the Table

Regular index creation in Postgres takes an exclusive lock on the table for the entire build duration. On a large table, that could be minutes. The fix is one magic keyword: CONCURRENTLY.

-- BAD: locks the table for the entire index build
CREATE INDEX idx_users_email ON users(email);

-- GOOD: builds in background, allows reads and writes
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

-- Also works for dropping
DROP INDEX CONCURRENTLY idx_users_email;

CONCURRENTLY takes longer to build — sometimes 2-3x — but your table stays fully accessible. The catch: it can't run inside a transaction. So if you're using a migration tool that wraps everything in a transaction (like Prisma does by default), you need to handle this separately.

With Drizzle, you can use the `sql` escape hatch for operations that need to run outside transactions:

// drizzle/migrations/0042_add_email_index.ts
import { sql } from 'drizzle-orm';
import type { MigrationConfig } from 'drizzle-orm/migrator';

export async function up(db: any) {
  // This must run outside a transaction
  await db.execute(
    sql`CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_email ON users(email)`
  );
}

export async function down(db: any) {
  await db.execute(
    sql`DROP INDEX CONCURRENTLY IF EXISTS idx_users_email`
  );
}

Always use CREATE INDEX CONCURRENTLY in production. The performance cost of a slightly slower build is nothing compared to a locked table during peak traffic.

Handling NOT NULL Columns on Existing Tables

Adding a NOT NULL column to a table with existing rows is a classic foot-gun. Postgres needs to verify (or set) the value for every existing row, which takes a lock while it does it. The safe pattern:

Step 1: Add the column as nullable with a default value
Step 2: Backfill all existing rows
Step 3: Add the NOT NULL constraint with VALIDATE deferred
Step 4: Set the final default if needed

-- Step 1: Add nullable column (fast, no row scanning)
ALTER TABLE orders ADD COLUMN status TEXT DEFAULT 'pending';

-- Step 2: Backfill in batches (do this in your app or a separate script)
-- Don't update 2M rows in one query — it holds locks
UPDATE orders 
SET status = 'completed' 
WHERE status IS NULL 
  AND created_at < '2024-01-01'
LIMIT 10000;
-- Run this repeatedly until no rows updated

-- Step 3: Add NOT NULL constraint (Postgres 12+ validates without full lock)
ALTER TABLE orders 
  ADD CONSTRAINT orders_status_not_null 
  CHECK (status IS NOT NULL) 
  NOT VALID;

-- This validates in background, doesn't block writes
ALTER TABLE orders VALIDATE CONSTRAINT orders_status_not_null;

-- Step 4: Now you can make it a real NOT NULL (fast because constraint exists)
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
ALTER TABLE orders DROP CONSTRAINT orders_status_not_null;

This is more verbose than a single ALTER TABLE, but it's the difference between a 30-second outage and a smooth deploy. The NOT VALID / VALIDATE pattern is particularly powerful in Postgres 12+ because the validation step doesn't block concurrent writes.

Batched Backfills: Don't Update Millions of Rows at Once

When you need to backfill data in a large table, running a single UPDATE on all rows is almost as bad as dropping the table. It holds locks, bloats your WAL, and makes your DBA (if you have one) very sad.

The pattern we use for backfills in Next.js apps is a simple API route or script that updates in small batches with a delay:

// scripts/backfill-user-slugs.ts
import { db } from '@/lib/db';
import { users } from '@/lib/db/schema';
import { isNull, sql } from 'drizzle-orm';

async function backfillUserSlugs() {
  const BATCH_SIZE = 500;
  const DELAY_MS = 100; // be nice to your database
  let totalUpdated = 0;

  while (true) {
    // Grab a batch of users without slugs
    const batch = await db
      .select({ id: users.id, name: users.name })
      .from(users)
      .where(isNull(users.slug))
      .limit(BATCH_SIZE);

    if (batch.length === 0) {
      console.log(`Done. Updated ${totalUpdated} users.`);
      break;
    }

    // Update each one (or do it in a single query with CASE)
    await db.transaction(async (tx) => {
      for (const user of batch) {
        const slug = user.name
          .toLowerCase()
          .replace(/[^a-z0-9]/g, '-')
          .replace(/-+/g, '-');

        await tx
          .update(users)
          .set({ slug })
          .where(sql`${users.id} = ${user.id}`);
      }
    });

    totalUpdated += batch.length;
    console.log(`Updated ${totalUpdated} users...`);

    // Small delay to avoid hammering the database
    await new Promise((resolve) => setTimeout(resolve, DELAY_MS));
  }
}

backfillUserSlugs().catch(console.error);

Run this script independently of your deploy. Start it, let it chug along for a few minutes, then deploy the code that depends on the new data being there. The 100ms delay sounds tiny but it gives the database breathing room between batches.

Migration Tooling: What We Actually Use

Both Prisma and Drizzle have migration runners, but they make different trade-offs for production use.

Prisma wraps migrations in transactions by default, which is safe but means you can't use CONCURRENTLY or other operations that require no transaction. You can work around this with `db.$executeRaw` and careful migration file structure.

Drizzle gives you more control. You can write raw SQL in migration files and decide per-migration whether to use a transaction. This is what we prefer for production systems where you need fine-grained control.

One thing both tools agree on: never run migrations automatically on startup in a multi-instance environment. If you have three server instances restarting simultaneously, three instances will try to run migrations at the same time. Some migration tools handle this with a lock table, but it's still a footgun. Better to run migrations as a separate step in your CI/CD pipeline before rolling out new instances.

# In your GitHub Actions deploy workflow
- name: Run database migrations
  run: npx drizzle-kit migrate
  env:
    DATABASE_URL: ${{ secrets.DATABASE_URL }}

# THEN deploy the new app version
- name: Deploy to production  
  run: vercel deploy --prod

Migrations should run once, before new code goes live. Never on app startup, never in parallel across multiple instances.

The Migration Checklist Before Every Deploy

We run through this before every production deploy that touches the database:

Does this migration take a lock? How long could it hold it?
Is this migration backward-compatible? Will old code break if the migration runs first?
Are we adding an index? Is it CONCURRENT?
Are we adding NOT NULL to an existing table? Did we backfill first?
Are we dropping a column? Is there still code that references it?
Can we test this migration on a production-sized dataset first? (Supabase branching and Neon branches are great for this)
Do we have a rollback plan that doesn't involve restoring from backup?

That last one is often ignored. For most migrations, the rollback is another migration that undoes the change. For a backfill, rolling back means setting those values back to NULL. Think it through before you deploy, not after.

If you're building on top of peal.dev templates, all of our starters come with Drizzle set up with sane production defaults — migrations run as a separate step, schema changes are tracked in version control, and we've stubbed out the patterns described here so you're not starting from scratch.

One Last Thing: Test on Real Data Sizes

The Friday-afternoon incident we mentioned at the top? The migration ran fine on our dev database with 200 rows. It ran fine in staging with 5,000 rows. Production had 2 million rows. That NOT NULL column addition took 11 minutes instead of 0.3 seconds.

Most cloud database providers (Neon, Supabase, PlanetScale) have branching or snapshot features now. Before any non-trivial migration, restore a production snapshot to a branch and run the migration against real data. If it takes 20 seconds on the snapshot, it'll take 20 seconds in production. You want to know that number before you deploy, not after.

Migrations are one of those things that feel solved until they aren't. The strategies here aren't clever tricks — they're just what happens when you've been burned enough times to stop taking shortcuts. Run migrations before deploys, use CONCURRENTLY for indexes, expand before you contract, and for the love of everything test on real data sizes. Your 3am self will thank you.