Your App Is Down and You Found Out From a User — Fix That

There's a specific kind of humiliation reserved for indie developers: you're having breakfast, feeling good about life, and then a user emails you 'hey, your site has been down for 3 hours.' We've been there. It's not great. The app was down, we didn't know, and someone who paid us money had to be the one to tell us.

Health checks and status pages are one of those things that feel optional until they're not. You build the feature, you ship it, and monitoring is 'something you'll add later.' Later arrives at 2am when everything is on fire and you're debugging in production with zero visibility. This post is about not being in that situation.

What a Health Check Actually Is

A health check is just an endpoint your app exposes that external tools can ping to see if everything's working. At its simplest, it's a route that returns 200 OK. At its most useful, it actually checks whether the things your app depends on — database, cache, third-party APIs — are also working.

The dumb version looks like this:

// app/api/health/route.ts
export async function GET() {
  return Response.json({ status: 'ok' });
}

This tells you your Next.js server is running and can handle requests. That's it. It won't tell you your database is down, your Redis cache is failing, or that the third-party email service you use just went dark. It's better than nothing, but only barely.

The useful version actually checks your dependencies:

// app/api/health/route.ts
import { db } from '@/lib/db';
import { sql } from 'drizzle-orm';

type HealthStatus = 'ok' | 'degraded' | 'down';

interface ComponentHealth {
  status: HealthStatus;
  latencyMs?: number;
  error?: string;
}

interface HealthResponse {
  status: HealthStatus;
  timestamp: string;
  components: {
    database: ComponentHealth;
  };
}

async function checkDatabase(): Promise<ComponentHealth> {
  const start = Date.now();
  try {
    await db.execute(sql`SELECT 1`);
    return {
      status: 'ok',
      latencyMs: Date.now() - start,
    };
  } catch (error) {
    return {
      status: 'down',
      error: error instanceof Error ? error.message : 'Unknown error',
    };
  }
}

export async function GET() {
  const [database] = await Promise.allSettled([checkDatabase()]);

  const dbHealth =
    database.status === 'fulfilled'
      ? database.value
      : { status: 'down' as HealthStatus, error: 'Check failed' };

  const overallStatus: HealthStatus =
    dbHealth.status === 'down' ? 'down' : 'ok';

  const body: HealthResponse = {
    status: overallStatus,
    timestamp: new Date().toISOString(),
    components: {
      database: dbHealth,
    },
  };

  return Response.json(body, {
    status: overallStatus === 'down' ? 503 : 200,
  });
}

The HTTP status code matters here. Return 503 when things are broken — not 200 with an error body. Uptime monitors check status codes, not response bodies, unless you configure them to. Keep it simple for the monitor.

What to Check (and What Not To)

The temptation is to check everything from your health endpoint. Don't. Your health check should be fast (under 2 seconds) and check things that, if broken, mean your app literally cannot serve users.

Database connectivity — if your DB is down, you're down
Cache connectivity (Redis/Upstash) — if you use it for sessions, losing it is critical
Any internal service your app directly depends on for core functionality
NOT: third-party APIs like Stripe, SendGrid, or OpenAI — their outage ≠ your app is down

We made the mistake of checking Stripe's API in our health endpoint early on. Stripe had a minor incident, our health check started returning 503, and our uptime monitor sent us 47 alerts. The app was completely fine — users could browse, log in, do everything. But we were getting paged like it was the apocalypse. Check your own infrastructure, not your vendors'.

A health check should answer one question: 'Can my app serve a user request right now?' If the answer is no, return 503. If yes, return 200. Don't overthink it.

Setting Up Uptime Monitoring

Once you have a health endpoint, you need something outside your infrastructure pinging it. This is important — if your server is on fire, you can't rely on your server to tell you it's on fire.

Options we've used:

Better Uptime — clean UI, good alerting, free tier is reasonable
UptimeRobot — the old reliable, free tier pings every 5 minutes
Checkly — more powerful if you want browser-level checks, not just HTTP
Pulsetic — simple, good for small projects

Configure it to check your /api/health endpoint every 1-5 minutes. Set up alerts via email, Slack, or SMS — whatever you'll actually notice at 2am. A 5-minute detection window means in the worst case you're down 5 minutes before you know. That's acceptable for most indie products. If you need faster detection, you're probably at a scale where this post isn't your primary concern.

One thing that catches people: make sure your health endpoint is excluded from authentication middleware. We've accidentally put our health check behind auth, meaning the uptime monitor couldn't hit it and was constantly alerting 'app is down' when the app was perfectly fine. Fun debugging session at the airport.

// middleware.ts
import { NextResponse } from 'next/server';
import type { NextRequest } from 'next/server';

export function middleware(request: NextRequest) {
  // Skip auth for health checks
  if (request.nextUrl.pathname === '/api/health') {
    return NextResponse.next();
  }

  // Your normal auth logic here
  // ...
}

export const config = {
  matcher: [
    '/((?!_next/static|_next/image|favicon.ico).*)',
  ],
};

Building a Status Page

A status page is where you tell your users what's going on when things break. Here's the thing though: your status page needs to be hosted somewhere separate from your main app. If your app is down and your status page is also your app, you have a problem.

Hosted status page options:

Better Uptime's built-in status pages — free, quick to set up, looks professional
Statuspage.io (by Atlassian) — the industry standard, but expensive
Instatus — good middle ground, more affordable than Statuspage
A static page on Cloudflare Pages — janky but independent from your infra

If you want to build your own status page and you're willing to accept that it might be down when your app is down (fine for small projects, not fine for SaaS), here's a simple approach using your health check data:

// app/status/page.tsx
async function getHealthData() {
  try {
    // Use absolute URL in server components
    const res = await fetch(
      `${process.env.NEXT_PUBLIC_APP_URL}/api/health`,
      {
        // Don't cache this — always fresh
        cache: 'no-store',
        // Short timeout so a hanging health check
        // doesn't make your status page hang too
        signal: AbortSignal.timeout(5000),
      }
    );
    return res.json();
  } catch {
    return { status: 'down', components: {} };
  }
}

export default async function StatusPage() {
  const health = await getHealthData();

  const statusConfig = {
    ok: { label: 'All Systems Operational', color: 'text-green-600', bg: 'bg-green-50', dot: 'bg-green-500' },
    degraded: { label: 'Partial Outage', color: 'text-yellow-600', bg: 'bg-yellow-50', dot: 'bg-yellow-500' },
    down: { label: 'Major Outage', color: 'text-red-600', bg: 'bg-red-50', dot: 'bg-red-500' },
  };

  const config = statusConfig[health.status as keyof typeof statusConfig] ?? statusConfig.down;

  return (
    <div className="min-h-screen bg-gray-50 py-12">
      <div className="max-w-2xl mx-auto px-4">
        <h1 className="text-2xl font-bold text-gray-900 mb-8">System Status</h1>

        <div className={`rounded-lg p-6 mb-8 ${config.bg}`}>
          <div className="flex items-center gap-3">
            <span className={`w-3 h-3 rounded-full ${config.dot}`} />
            <span className={`font-semibold ${config.color}`}>{config.label}</span>
          </div>
        </div>

        <div className="bg-white rounded-lg border border-gray-200 divide-y divide-gray-200">
          {Object.entries(health.components ?? {}).map(([name, component]: [string, any]) => (
            <div key={name} className="flex items-center justify-between p-4">
              <span className="text-gray-700 capitalize">{name}</span>
              <div className="flex items-center gap-2">
                {component.latencyMs && (
                  <span className="text-sm text-gray-400">{component.latencyMs}ms</span>
                )}
                <span
                  className={`text-sm font-medium ${
                    component.status === 'ok' ? 'text-green-600' : 'text-red-600'
                  }`}
                >
                  {component.status === 'ok' ? 'Operational' : 'Down'}
                </span>
              </div>
            </div>
          ))}
        </div>

        <p className="text-sm text-gray-400 mt-6 text-center">
          Last updated: {new Date().toLocaleString()}
        </p>
      </div>
    </div>
  );
}

Incident Communication Is Half the Battle

When something goes wrong, the technical fix is often the easy part. What kills user trust is silence. People can handle outages — they've all used software long enough to know things break. What they can't handle is not knowing what's happening or when it'll be fixed.

Our process when something goes down:

First 5 minutes: Acknowledge the incident on the status page, even if you don't know the cause yet
Every 20-30 minutes: Post an update, even if it's just 'still investigating'
When resolved: Write a brief post-mortem — what happened, why, what you're doing to prevent it
Don't delete the incident history — it builds trust, not the opposite

We had a database incident last year where our connection pool got exhausted and queries started timing out. We had a status page update live within 8 minutes of detecting it. Several users replied to say they appreciated the quick communication. The incident itself was bad; how we handled it made a real difference.

An honest 'we're investigating, here's what we know' posted in 5 minutes does more for user trust than a detailed post-mortem posted 2 hours later.

Adding Latency Tracking to Your Health Check

Being 'up' and being 'fast' are different things. Your app can technically return 200 OK while your database is taking 8 seconds per query and your users are rage-quitting. For this reason it's worth tracking latency in your health responses and alerting when things get slow, not just when they go down completely.

Most uptime monitors let you set a 'response time threshold' — if your /api/health takes more than X milliseconds, trigger an alert. Set this to something like 3000ms for a degraded alert. It gives you early warning before a slowdown turns into a full outage.

You can also expose latency metrics in your health response (as shown in the earlier code) and then query those from a simple dashboard. Nothing fancy — even a spreadsheet pulling from your health endpoint every few minutes via a scheduled task would give you a rough idea of database latency trends over time.

The Minimal Viable Monitoring Stack

If you're starting from zero and want to get to a reasonable monitoring setup in an afternoon, here's the minimum that matters:

A /api/health endpoint that checks your database and returns 503 on failure
UptimeRobot or Better Uptime monitoring that endpoint every 5 minutes
Email + Slack alert when it goes down
A status page hosted outside your main app (even a free Better Uptime status page works)
Your status page URL in your app footer and support emails so users know where to look

That's it. Seriously. You can add more later — synthetic monitoring, browser-level checks, distributed tracing — but the above will catch 90% of real production incidents and give you somewhere to send users when things break.

For the peal.dev templates, we've started including a basic health check endpoint and status page scaffold out of the box. It's one of those things where the 10 minutes it saves you at launch is worth it, because launch day is already chaotic enough without also setting up monitoring from scratch.

The bottom line: your users will forgive downtime. They won't forgive finding out about it before you did. Set up a health check, set up a monitor, and make sure you're the first to know when your app is having a bad day. Future you — the one who gets an alert at 2am instead of a support email at 9am — will be grateful.