Health Checks and Status Pages: Stop Finding Out Your App Is Down From Users

Here's a fun way to start your Monday: a Slack message from a user saying 'hey is your site down?' followed by you frantically opening your laptop while pretending you already knew. We've been there. Twice. The second time we were at a gas station in Cluj. The third time we had health checks set up and found out 4 minutes after the incident started, fixed it, and the user never knew.

Health checks and status pages feel like boring infrastructure work — the kind of thing you keep pushing to next sprint. But they're genuinely one of the highest ROI things you can add to a production app. Not because they prevent downtime (they don't), but because they compress the time between 'something broke' and 'someone fixed it' from hours to minutes.

What a health check actually does

A health check is just an HTTP endpoint that returns whether your app is healthy. External monitoring services ping it every 30-60 seconds. If it stops responding or returns a non-200, you get an alert. Simple idea, surprisingly easy to mess up.

The mistake most people make is writing a health check that just returns 200 OK unconditionally. That's a heartbeat check, not a health check. It tells you the Node process is alive, not that your app actually works. Your database could be completely unreachable and your 'health check' would still be green.

A real health check verifies the things your app depends on: database connectivity, cache availability, any critical third-party APIs. If any of those are broken, your app is broken — the health check should say so.

Building a proper health check endpoint in Next.js

Here's how we structure health check endpoints. We use a Route Handler in the App Router and actually test our dependencies:

// app/api/health/route.ts
import { NextResponse } from 'next/server'
import { db } from '@/lib/db'

type HealthStatus = 'healthy' | 'degraded' | 'unhealthy'

interface ServiceCheck {
  status: HealthStatus
  latency_ms: number
  error?: string
}

interface HealthResponse {
  status: HealthStatus
  timestamp: string
  version: string
  services: {
    database: ServiceCheck
    [key: string]: ServiceCheck
  }
}

async function checkDatabase(): Promise<ServiceCheck> {
  const start = Date.now()
  try {
    // A lightweight query that actually hits the DB
    await db.execute('SELECT 1')
    return {
      status: 'healthy',
      latency_ms: Date.now() - start,
    }
  } catch (err) {
    return {
      status: 'unhealthy',
      latency_ms: Date.now() - start,
      error: err instanceof Error ? err.message : 'Unknown error',
    }
  }
}

export async function GET() {
  const [database] = await Promise.all([
    checkDatabase(),
    // add checkRedis(), checkStripe(), etc. here
  ])

  const services = { database }

  const overallStatus: HealthStatus = Object.values(services).some(
    (s) => s.status === 'unhealthy'
  )
    ? 'unhealthy'
    : Object.values(services).some((s) => s.status === 'degraded')
    ? 'degraded'
    : 'healthy'

  const response: HealthResponse = {
    status: overallStatus,
    timestamp: new Date().toISOString(),
    version: process.env.NEXT_PUBLIC_APP_VERSION ?? 'unknown',
    services,
  }

  return NextResponse.json(response, {
    status: overallStatus === 'unhealthy' ? 503 : 200,
  })
}

The key things here: we return 503 when unhealthy (not 200 with an error message — some monitoring tools only check status codes), we include latency so we can catch slow-but-not-broken states, and we structure it so adding new service checks is trivial.

Never return 200 from a health check when something critical is broken. Monitoring tools watch status codes, not JSON bodies. A 200 with {status: 'unhealthy'} will not trigger your alerts.

Tiered health checks: not everything is equally critical

One pattern we've landed on: separate health checks for different criticality levels. Your database being down is different from your email service being slow. Both matter, but they shouldn't both page you at 3am with the same urgency.

// app/api/health/deep/route.ts
// More thorough check — run this less frequently (every 5 min)
import { NextResponse } from 'next/server'
import { db } from '@/lib/db'
import { redis } from '@/lib/redis'

async function checkRedis() {
  const start = Date.now()
  try {
    await redis.ping()
    return { status: 'healthy' as const, latency_ms: Date.now() - start }
  } catch (err) {
    return {
      status: 'degraded' as const,
      latency_ms: Date.now() - start,
      error: err instanceof Error ? err.message : 'Redis unavailable',
    }
  }
}

async function checkDatabaseWritable() {
  const start = Date.now()
  try {
    // Actually test write path, not just a SELECT
    await db.execute(
      'INSERT INTO health_pings (pinged_at) VALUES (NOW()) ON CONFLICT DO NOTHING'
    )
    return { status: 'healthy' as const, latency_ms: Date.now() - start }
  } catch (err) {
    return {
      status: 'unhealthy' as const,
      latency_ms: Date.now() - start,
      error: err instanceof Error ? err.message : 'DB write failed',
    }
  }
}

export async function GET() {
  const [database, cache] = await Promise.allSettled([
    checkDatabaseWritable(),
    checkRedis(),
  ])

  const db_result =
    database.status === 'fulfilled'
      ? database.value
      : { status: 'unhealthy' as const, latency_ms: 0, error: 'Check threw' }

  const cache_result =
    cache.status === 'fulfilled'
      ? cache.value
      : { status: 'degraded' as const, latency_ms: 0, error: 'Check threw' }

  const httpStatus = db_result.status === 'unhealthy' ? 503 : 200

  return NextResponse.json(
    { database: db_result, cache: cache_result, timestamp: new Date().toISOString() },
    { status: httpStatus }
  )
}

Note the Promise.allSettled instead of Promise.all — if one check throws unexpectedly, you still get results from the others instead of the whole endpoint crashing. We learned this after a health check endpoint itself 500ing, which is a special kind of embarrassing.

Picking a monitoring service and setting up alerts

There are a bunch of options here. We've used or evaluated most of them:

Better Uptime — our current pick. Clean UI, incident management built in, status page included. Reasonably priced.
UptimeRobot — free tier is genuinely useful. 5-minute check intervals on free, 1-minute on paid. Good for early-stage.
Checkly — more developer-focused, lets you write real Playwright scripts as monitors instead of just pinging URLs. Overkill for most apps but powerful.
Grafana Cloud — if you're already in that ecosystem. Has a free tier for uptime monitoring.
Pinging.net — criminally underrated, extremely simple, very cheap.

Whatever you pick, set up at minimum: an alert to Slack or Discord within 2 minutes of downtime, an SMS or phone call for anything that's been down more than 5 minutes, and separate alerts for your deep health check vs your shallow one. You want to know about database issues fast. You can be slightly more relaxed about cache degradation.

Building a status page that people actually trust

A status page has two jobs. First, when things are broken, it stops your support queue from filling up with 'is the site down?' tickets. Second — and this one is underrated — it signals to users that you take reliability seriously. A well-maintained status page is a trust signal.

The simplest approach: use a hosted status page from your monitoring provider. Better Uptime, Instatus, and Statuspage.io all let you create one in 10 minutes and connect it directly to your uptime monitors. When your health check goes red, the status page updates automatically. This is what we recommend for most apps.

But if you want to build your own — maybe you want it at status.yourdomain.com with your own design — here's the minimum viable version using Next.js and a database to store incident history:

// app/status/page.tsx
// Simple status page that reads from your own DB
import { db } from '@/lib/db'

type ServiceStatus = 'operational' | 'degraded' | 'outage'

interface Incident {
  id: string
  title: string
  status: 'investigating' | 'identified' | 'monitoring' | 'resolved'
  created_at: Date
  resolved_at: Date | null
  updates: { message: string; created_at: Date }[]
}

async function getCurrentStatus(): Promise<Record<string, ServiceStatus>> {
  // In practice, you'd query a services_status table that your
  // monitoring webhook updates automatically
  const statuses = await db.query(
    `SELECT service_name, status 
     FROM service_statuses 
     WHERE updated_at > NOW() - INTERVAL '10 minutes'`
  )
  return Object.fromEntries(statuses.rows.map((r) => [r.service_name, r.status]))
}

async function getRecentIncidents(): Promise<Incident[]> {
  const incidents = await db.query(
    `SELECT i.*, 
       json_agg(u ORDER BY u.created_at DESC) as updates
     FROM incidents i
     LEFT JOIN incident_updates u ON u.incident_id = i.id
     WHERE i.created_at > NOW() - INTERVAL '90 days'
     GROUP BY i.id
     ORDER BY i.created_at DESC
     LIMIT 10`
  )
  return incidents.rows
}

export default async function StatusPage() {
  const [statuses, incidents] = await Promise.all([
    getCurrentStatus(),
    getRecentIncidents(),
  ])

  const allOperational = Object.values(statuses).every((s) => s === 'operational')

  return (
    <main className="max-w-2xl mx-auto py-16 px-4">
      <h1 className="text-2xl font-bold mb-2">System Status</h1>

      <div
        className={`rounded-lg p-4 mb-8 ${
          allOperational ? 'bg-green-50 text-green-800' : 'bg-red-50 text-red-800'
        }`}
      >
        {allOperational ? '✓ All systems operational' : '⚠ Service disruption detected'}
      </div>

      <section className="mb-8">
        <h2 className="text-lg font-semibold mb-4">Services</h2>
        {Object.entries(statuses).map(([service, status]) => (
          <div key={service} className="flex justify-between py-3 border-b">
            <span className="capitalize">{service.replace(/_/g, ' ')}</span>
            <span
              className={`text-sm font-medium ${
                status === 'operational'
                  ? 'text-green-600'
                  : status === 'degraded'
                  ? 'text-yellow-600'
                  : 'text-red-600'
              }`}
            >
              {status}
            </span>
          </div>
        ))}
      </section>

      {incidents.length > 0 && (
        <section>
          <h2 className="text-lg font-semibold mb-4">Recent Incidents</h2>
          {incidents.map((incident) => (
            <div key={incident.id} className="mb-6 border rounded-lg p-4">
              <div className="flex justify-between mb-2">
                <h3 className="font-medium">{incident.title}</h3>
                <span className="text-sm text-gray-500">
                  {incident.resolved_at ? 'Resolved' : incident.status}
                </span>
              </div>
              {incident.updates?.map((update, i) => (
                <div key={i} className="text-sm text-gray-600 mt-2">
                  <span className="text-gray-400">
                    {new Date(update.created_at).toLocaleString()} —
                  </span>{' '}
                  {update.message}
                </div>
              ))}
            </div>
          ))}
        </section>
      )}
    </main>
  )
}

One thing worth setting up: a webhook from your monitoring provider that automatically updates your database when monitors go up or down. Better Uptime supports this natively. That way, your status page reflects reality without you manually logging in during an incident.

The incident communication part nobody talks about

Technical monitoring is the easy part. The hard part is communicating well when things break. Here's the actual pattern that works:

Post on your status page within 5 minutes of detecting an issue, even if you know nothing yet. 'We are investigating reports of elevated error rates' is better than silence.
Update every 20-30 minutes while investigating, even to say 'still investigating, no update yet'. The uncertainty is the worst part for users.
When resolved, write a brief post-mortem on the status page. What broke, why, what you changed to prevent it. Users respect this enormously.
Email your paying customers for any outage over 30 minutes. Don't make them find the status page — come to them.

A 2-hour outage with good communication hurts less than a 20-minute outage with silence. Users can handle things breaking. What they can't handle is not knowing if you know.

Don't forget to protect your health check endpoint

Two security things that often get missed. First, your health check endpoint might leak internal information — error messages, service names, infrastructure details. Consider having two versions: a public one that just returns healthy/unhealthy, and an authenticated one with the full detail for your monitoring service to use.

Second, health checks can be a minor DDoS vector if they're expensive and you expose them publicly. Keep them lightweight. If your database check involves a real query, make sure it's indexed and fast — you're hitting this endpoint every 30 seconds indefinitely. We had a health check query that was doing a full table scan for a few weeks before we noticed it in our slow query logs. Not our finest hour.

// Protecting your detailed health check with a secret token
// app/api/health/detailed/route.ts
import { NextRequest, NextResponse } from 'next/server'

export async function GET(req: NextRequest) {
  const token = req.headers.get('x-health-token')
  
  if (token !== process.env.HEALTH_CHECK_SECRET) {
    return NextResponse.json({ error: 'Unauthorized' }, { status: 401 })
  }

  // ... run full checks and return detail
  return NextResponse.json({ status: 'healthy', services: {} })
}

// In your monitoring tool, add this header to the check:
// X-Health-Token: your-secret-here

The public health check at /api/health stays simple and unauthenticated (monitoring tools that just check status codes can use it). The detailed one at /api/health/detailed is only accessible to your monitoring service.

Putting it all together

If you're starting from scratch, here's the 2-hour implementation plan: add a basic health check endpoint that checks your database, set up a free UptimeRobot account to ping it every 5 minutes with an email alert, and create a free status page on Instatus or Better Uptime. That's the minimum. You're now in the 'knows when things break' club.

When you want to go deeper: add more service checks, set up proper Slack/SMS alerting, build or buy a proper status page at status.yourdomain.com, and define your incident communication process before you need it (not during).

Most of the Next.js templates on peal.dev come with a basic health check endpoint already wired up — it's one of those things that's easy to include from the start and annoying to add later when you're already in production.

The gas station incident was a payment processing bug that brought down checkout for about 40 minutes. We found out from a user. We didn't have monitoring. That was the last time. The next production incident — database connection pool exhaustion, two months later — we got an alert at 2:07am, fixed it by 2:24am, and nobody ever knew. That's the difference. Worth the 2 hours of setup.