Health Checks and Status Pages — Knowing When Your App Is Down Before Your Users Do

There's a specific kind of shame that comes from a user opening a GitHub issue titled 'Is the site down?' and you, the developer, having to check if your own app is running. We've been there. It was 11am on a Tuesday, not even a dramatic 2am outage — just a quiet death that nobody caught for three hours because we had zero monitoring in place.

Health checks and status pages aren't glamorous. They don't make it into the MVP feature list. But the first time your database connection pool silently maxes out and your app starts returning 500s while you're eating lunch, you'll wish you'd spent two hours on this instead of that third dark mode variant.

What a Health Check Actually Is

A health check is just an endpoint that tells you if your app can do its job. Not just 'is the server running' — that's table stakes. A real health check verifies that your app can talk to the database, reach Redis, connect to whatever third-party services it depends on. The difference matters: your Next.js process can be alive and actively failing every request because Postgres is unreachable.

There are two flavors worth knowing: liveness checks (is the process alive at all?) and readiness checks (is the process ready to serve traffic?). Kubernetes makes this distinction explicit. On simpler setups like Vercel or Railway, you typically just need one endpoint that does a meaningful check and returns a clear status.

// app/api/health/route.ts
import { NextResponse } from 'next/server'
import { db } from '@/lib/db'

export const runtime = 'nodejs'
export const dynamic = 'force-dynamic'

type HealthStatus = 'ok' | 'degraded' | 'down'

interface ServiceCheck {
  status: HealthStatus
  latencyMs: number
  error?: string
}

async function checkDatabase(): Promise<ServiceCheck> {
  const start = Date.now()
  try {
    await db.execute('SELECT 1')
    return { status: 'ok', latencyMs: Date.now() - start }
  } catch (err) {
    return {
      status: 'down',
      latencyMs: Date.now() - start,
      error: err instanceof Error ? err.message : 'Unknown error',
    }
  }
}

async function checkRedis(): Promise<ServiceCheck> {
  const start = Date.now()
  try {
    // Replace with your actual Redis client
    await redis.ping()
    return { status: 'ok', latencyMs: Date.now() - start }
  } catch (err) {
    return {
      status: 'degraded', // Redis down = degraded, not full outage
      latencyMs: Date.now() - start,
      error: err instanceof Error ? err.message : 'Unknown error',
    }
  }
}

export async function GET() {
  const [database, cache] = await Promise.allSettled([
    checkDatabase(),
    checkRedis(),
  ])

  const db_result = database.status === 'fulfilled' ? database.value : { status: 'down' as HealthStatus, latencyMs: 0 }
  const cache_result = cache.status === 'fulfilled' ? cache.value : { status: 'degraded' as HealthStatus, latencyMs: 0 }

  const overallStatus: HealthStatus =
    db_result.status === 'down' ? 'down'
    : cache_result.status === 'down' ? 'degraded'
    : 'ok'

  const statusCode = overallStatus === 'down' ? 503 : 200

  return NextResponse.json(
    {
      status: overallStatus,
      timestamp: new Date().toISOString(),
      version: process.env.NEXT_PUBLIC_APP_VERSION ?? 'unknown',
      services: {
        database: db_result,
        cache: cache_result,
      },
    },
    { status: statusCode }
  )
}

Notice we're returning 503 when things are actually down. This is important — monitoring services check HTTP status codes, not response bodies. A health endpoint that returns 200 with a JSON body saying 'status: down' is a troll endpoint. It will fool every uptime monitor out there.

Use Promise.allSettled instead of Promise.all for health checks. You want all checks to complete even if one throws — a rejected Promise.all will hide the results of everything else.

Protecting Your Health Endpoint

Two things to be careful about here. First, don't expose sensitive information. If your database check fails with a connection string error, strip that before sending it to the client. Second, consider whether this endpoint should be public or protected. Public is convenient for monitoring services. Protected is better if you're worried about leaking infrastructure details.

A middle ground: keep the endpoint public but sanitize the error messages. Log the full error server-side, return a generic message to the response.

// Sanitize errors before sending to client
function sanitizeError(err: unknown): string {
  if (!(err instanceof Error)) return 'Check failed'
  
  // Don't leak connection strings, credentials, internal IPs
  const message = err.message
  if (
    message.includes('postgresql://') ||
    message.includes('password') ||
    message.includes('ECONNREFUSED')
  ) {
    return 'Connection failed'
  }
  
  return message
}

// Also add a lightweight secret check if you want
// monitoring services to use a token
export async function GET(request: Request) {
  const token = request.headers.get('x-health-token')
  const isDetailed = token === process.env.HEALTH_CHECK_SECRET
  
  // Run checks...
  
  return NextResponse.json({
    status: overallStatus,
    // Only include service details for authenticated requests
    ...(isDetailed && { services: { database: db_result } }),
  })
}

Uptime Monitoring — Who Pings Your Health Check?

Your health endpoint is useless if nothing calls it. You need an external service pinging it every minute (or every 30 seconds if you're paranoid) from outside your infrastructure. The key word is external — a monitor running on the same server that goes down with your app is just performance art.

Tools we've actually used and can recommend:

Better Stack (formerly Logtail) — clean UI, generous free tier, incident timeline is excellent
UptimeRobot — the classic choice, free for up to 50 monitors, gets the job done
Checkly — if you want to go further and run Playwright checks against real user flows, not just HTTP pings
Grafana Cloud — overkill for most, but if you're already doing metrics, the uptime checks are built in

Set up alerts to go somewhere you'll actually see them. Email works if you check email. Slack channel dedicated to alerts works better. What doesn't work is a PagerDuty integration you set up once and forgot to configure — ask us how we know.

Building a Status Page (Without Third-Party Hostages)

A status page serves a different purpose than internal monitoring. It's for your users — a place they can check when something feels off. The irony of status pages is that they need to work when your app is down. Hosting your status page on the same infrastructure as your app is like putting your spare tire in the trunk of a car that broke down because the trunk won't open.

The pragmatic options:

statuspage.io — Atlassian product, solid but pricey once you scale
Instatus — cheaper, does the job, good enough for indie devs and small SaaS
Better Stack's Status Pages — same tool as your monitoring, status page included
Roll your own on a separate domain/CDN — more control, more work

If you want to build a minimal status page that pulls from your own data, here's the pattern we use — a separate Next.js app (or even a static page) deployed independently that polls your health endpoint and displays the result:

// A minimal status page component
// Deploy this on a SEPARATE domain/deployment from your main app
// e.g., status.yourapp.com on Vercel (separate project)

interface StatusData {
  status: 'ok' | 'degraded' | 'down'
  lastChecked: string
  services: Record<string, { status: string; latencyMs: number }>
}

// app/page.tsx on your status subdomain
export default async function StatusPage() {
  let data: StatusData | null = null
  let fetchError = false

  try {
    const res = await fetch('https://yourapp.com/api/health', {
      headers: { 'x-health-token': process.env.HEALTH_CHECK_SECRET! },
      next: { revalidate: 60 }, // Revalidate every minute
    })
    data = await res.json()
  } catch {
    fetchError = true
  }

  const statusConfig = {
    ok: { label: 'All Systems Operational', color: 'bg-green-500', textColor: 'text-green-700' },
    degraded: { label: 'Partial Outage', color: 'bg-yellow-500', textColor: 'text-yellow-700' },
    down: { label: 'Major Outage', color: 'bg-red-500', textColor: 'text-red-700' },
  }

  const current = fetchError
    ? statusConfig.down
    : statusConfig[data?.status ?? 'down']

  return (
    <main className="max-w-2xl mx-auto py-16 px-4">
      <h1 className="text-2xl font-bold mb-8">System Status</h1>
      <div className={`rounded-lg p-6 mb-8 ${current.color} bg-opacity-10`}>
        <div className="flex items-center gap-3">
          <div className={`w-3 h-3 rounded-full ${current.color}`} />
          <span className={`font-semibold ${current.textColor}`}>
            {current.label}
          </span>
        </div>
      </div>
      {data?.services && (
        <div className="space-y-3">
          {Object.entries(data.services).map(([name, check]) => (
            <div key={name} className="flex items-center justify-between py-3 border-b">
              <span className="capitalize">{name}</span>
              <div className="flex items-center gap-2">
                <span className="text-sm text-gray-500">{check.latencyMs}ms</span>
                <span className={check.status === 'ok' ? 'text-green-600' : 'text-red-600'}>
                  {check.status}
                </span>
              </div>
            </div>
          ))}
        </div>
      )}
      <p className="text-sm text-gray-400 mt-8">
        Last updated: {data?.lastChecked ?? 'unknown'}
      </p>
    </main>
  )
}

The critical thing here: this status page is a separate Vercel project, pointing at a separate subdomain. When your main app is down, Vercel still serves this page. Your users can check it. You look professional instead of like someone whose apartment is also their office and also on fire.

Incident History — The Trust-Building Part

A status page that only shows current status is useful. A status page with incident history is a trust-building machine. When a prospective customer or an existing user checks your status page and sees that you had an outage three months ago, documented it clearly, and resolved it — that's more reassuring than claiming 99.9% uptime with no receipts.

For incident history, you have two realistic options: use a managed status page tool that tracks incidents automatically, or maintain a simple incidents table in your database and display it. If you're rolling your own, a simple schema works fine:

CREATE TABLE incidents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  title TEXT NOT NULL,
  status TEXT NOT NULL CHECK (status IN ('investigating', 'identified', 'monitoring', 'resolved')),
  impact TEXT NOT NULL CHECK (impact IN ('minor', 'major', 'critical')),
  started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  resolved_at TIMESTAMPTZ,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE incident_updates (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  incident_id UUID NOT NULL REFERENCES incidents(id) ON DELETE CASCADE,
  message TEXT NOT NULL,
  status TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_incidents_started ON incidents(started_at DESC);
CREATE INDEX idx_incident_updates_incident ON incident_updates(incident_id, created_at DESC);

Then build a simple admin page (protected by your auth) where you can create incidents and post updates. The status page queries this table alongside the live health check. When something goes wrong, you post an update, users see it immediately.

The Alerting Stack — Getting Notified Without Going Crazy

One mistake we made early on: alerting on everything. Every blip, every 5xx spike, every deployment moment where traffic briefly hiccups. Within a week we were ignoring alerts the same way people ignore car alarms. Alert fatigue is real and it kills the whole point.

The hierarchy that actually works:

P0 (page immediately): Health check returning 503, sustained 5xx rate above 10% for 2+ minutes
P1 (Slack notification): Response time degraded significantly, database latency spike, partial service issues
P2 (daily digest): Error rate uptick, slower-than-usual queries, things to watch
Don't alert on: individual failed requests, expected errors (404s from bots), deployment restarts

Most monitoring tools let you configure alert thresholds and cooldown periods. Use them. A monitor that fires once and then waits five minutes before firing again prevents your phone from exploding during an incident when you're already aware of the problem and trying to fix it.

The goal of alerting is to wake you up when something needs human attention — not to document every hiccup. If you're getting more than 2-3 alerts per week that turn out to be non-issues, your thresholds are wrong.

Tying It All Together

Here's the full picture of what a minimal but solid monitoring setup looks like for a Next.js SaaS:

/api/health endpoint that actually checks database, cache, and critical dependencies
External uptime monitor (Better Stack or UptimeRobot) pinging it every 60 seconds
Alerts going to a Slack channel AND email, with proper cooldown periods
Status page on a separate deployment at status.yourapp.com
Incident history table so you can communicate during outages
A simple runbook (even a Notion doc) with steps for common failure scenarios

That last one — the runbook — is underrated. At 2am when you're half asleep and the database is down, you don't want to be making architectural decisions from scratch. 'Step 1: check Railway logs. Step 2: try connection from local. Step 3: if connection pool exhausted, run this query.' Write it when you're calm so you can follow it when you're not.

Our templates at peal.dev ship with the health check endpoint already wired up — database check, basic error sanitization, the right HTTP status codes. It's one of those things that's easy to forget when you're moving fast and painful to add after the fact.

The real shift in thinking is this: monitoring isn't about knowing when things break. It's about finding out before your users do. That gap — between when your app starts failing and when you find out — is where reputation damage lives. Close that gap to zero and you go from reactive firefighting to actually being in control.

Your users will forgive downtime. They won't forgive finding out about it before you did.