Graceful Degradation: Building Apps That Keep Working When Things Break

At 2am, we once pushed a feature that relied on a third-party enrichment API. By 9am, that API was down. Our entire onboarding flow was broken — not because our code was wrong, but because we'd assumed external services are as reliable as localhost. They're not. Nothing is.

Graceful degradation is the practice of designing your app so that when something fails — an API, a database query, a feature flag service, a payment processor — users still get something useful instead of a white screen and a stack trace. It's not pessimism. It's engineering maturity.

The failure modes nobody talks about in tutorials

Most tutorials show you the happy path. API returns 200, data renders, everyone's happy. Real production traffic looks different. Here are the failure modes we've actually hit:

Third-party API returns 503 for 20 minutes during peak traffic
Database connection pool exhausted under load — queries just hang
Stripe webhook delivery delayed by 30 seconds, so your UI shows wrong state
Edge function cold start timeout on the first request after deploy
Redis cache evicts your session data under memory pressure
External image CDN goes down, breaking your <Image> components
DNS lookup failure for a service you assumed was always reachable

None of these are your fault. All of them are your problem. The difference between a good app and a fragile one is how you handle each of these situations when they inevitably happen.

Pattern 1: Wrap external calls with timeouts and fallbacks

The most common mistake is calling external services with no timeout. A hung connection isn't the same as a failed connection — it's worse, because it silently blocks your response while users stare at a spinner. Always set explicit timeouts, and always have a fallback.

// lib/with-timeout.ts
export async function withTimeout<T>(
  promise: Promise<T>,
  ms: number,
  fallback: T
): Promise<T> {
  const timeout = new Promise<T>((resolve) =>
    setTimeout(() => resolve(fallback), ms)
  );
  return Promise.race([promise, timeout]);
}

// Usage in a server component or API route
import { withTimeout } from '@/lib/with-timeout';

const userEnrichment = await withTimeout(
  fetchEnrichmentData(user.email),
  2000, // 2 second max — if it takes longer, not worth waiting
  { plan: 'unknown', company: null, role: null } // sensible default
);

// The rest of your code continues with either real data or the fallback
// Users get a working experience either way

Two seconds is usually our threshold for non-critical enrichment data. For anything blocking the main render, it's 500ms. For critical data that must be accurate, you throw and show a proper error — but that's the exception, not the rule.

Pattern 2: Error boundaries that actually help users

React error boundaries exist, but most people either don't use them or put one giant one at the app root. That's too coarse. A single broken widget shouldn't take down your entire dashboard. Wrap features independently so failures are contained.

// components/feature-boundary.tsx
'use client';

import { Component, ReactNode } from 'react';

interface Props {
  children: ReactNode;
  fallback?: ReactNode;
  featureName?: string;
}

interface State {
  hasError: boolean;
  error?: Error;
}

export class FeatureBoundary extends Component<Props, State> {
  state: State = { hasError: false };

  static getDerivedStateFromError(error: Error): State {
    return { hasError: true, error };
  }

  componentDidCatch(error: Error) {
    // Log to your error tracker but don't crash the page
    console.error(`[${this.props.featureName ?? 'feature'}] crashed:`, error);
    // reportToSentry(error, { feature: this.props.featureName });
  }

  render() {
    if (this.state.hasError) {
      return this.props.fallback ?? (
        <div className="rounded-md border border-dashed p-4 text-sm text-muted-foreground">
          This section is temporarily unavailable.
        </div>
      );
    }
    return this.props.children;
  }
}

// Usage — dashboard with independently failing widgets
export function Dashboard() {
  return (
    <div className="grid grid-cols-3 gap-4">
      <FeatureBoundary featureName="analytics-chart" fallback={<ChartSkeleton />}>
        <AnalyticsChart />
      </FeatureBoundary>

      <FeatureBoundary featureName="recent-activity">
        <RecentActivity />
      </FeatureBoundary>

      <FeatureBoundary featureName="billing-status">
        <BillingStatus />
      </FeatureBoundary>
    </div>
  );
}

Now when your analytics provider has an outage, users see a skeleton instead of a broken dashboard. The billing widget still works. The activity feed still works. The experience degrades gracefully rather than collapsing entirely.

Pattern 3: Stale-while-revalidate for data freshness

Next.js fetch has built-in stale-while-revalidate support, and it's genuinely one of the most useful patterns for resilience. Serve cached data instantly, revalidate in the background. If the revalidation fails, users still see the stale data instead of an error.

// Fetch with stale-while-revalidate — serves cached data even if upstream is down
const response = await fetch('https://api.example.com/pricing', {
  next: {
    revalidate: 300, // revalidate every 5 minutes
  },
});

// For database queries, you can implement the same pattern manually
import { unstable_cache } from 'next/cache';

export const getPricingPlans = unstable_cache(
  async () => {
    // This is the expensive/potentially-failing operation
    const plans = await db.query.plans.findMany({
      where: eq(plans.active, true),
      orderBy: asc(plans.price),
    });
    return plans;
  },
  ['pricing-plans'],
  {
    revalidate: 300,
    tags: ['pricing'],
  }
);

// In your component — if the DB is temporarily unavailable,
// Next.js serves the cached version from the last successful fetch
export async function PricingSection() {
  const plans = await getPricingPlans();
  return <PricingTable plans={plans} />;
}

Stale data shown confidently is almost always better than no data shown nervously. Price plans from 5 minutes ago are fine. An error page is not.

Pattern 4: Circuit breakers for repeated failures

A timeout handles a single slow request. A circuit breaker handles a service that's been down for 10 minutes. Without one, your app keeps hammering a dead API on every request, wasting time and resources. A circuit breaker detects repeated failures and opens the circuit — short-circuiting to the fallback immediately until the service recovers.

// lib/circuit-breaker.ts
interface CircuitBreakerOptions {
  failureThreshold: number;
  resetTimeoutMs: number;
}

type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';

export class CircuitBreaker {
  private failures = 0;
  private state: CircuitState = 'CLOSED';
  private nextAttempt = 0;

  constructor(private options: CircuitBreakerOptions) {}

  async call<T>(fn: () => Promise<T>, fallback: T): Promise<T> {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        // Still in cooldown — return fallback immediately, no network call
        return fallback;
      }
      this.state = 'HALF_OPEN';
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch {
      this.onFailure();
      return fallback;
    }
  }

  private onSuccess() {
    this.failures = 0;
    this.state = 'CLOSED';
  }

  private onFailure() {
    this.failures++;
    if (this.failures >= this.options.failureThreshold) {
      this.state = 'OPEN';
      this.nextAttempt = Date.now() + this.options.resetTimeoutMs;
    }
  }
}

// Create breakers per service — store them as module-level singletons
// (works per-instance in serverless, but still helps within a single invocation chain)
const enrichmentBreaker = new CircuitBreaker({
  failureThreshold: 3,   // open after 3 failures
  resetTimeoutMs: 60000, // try again after 1 minute
});

export async function getEnrichedUser(email: string) {
  return enrichmentBreaker.call(
    () => fetchEnrichmentData(email),
    { plan: 'unknown', company: null } // fallback value
  );
}

In serverless environments, circuit breaker state doesn't persist across invocations perfectly — each cold instance starts fresh. But it still protects you within warm instances and during the window when a single instance handles multiple requests. For persistent state, you'd store the circuit state in Redis, which is worth it for high-traffic services.

Pattern 5: Degrade the UI, not just the data

Graceful degradation isn't just about keeping data flowing. It's about what users see and experience. The hierarchy should be: full experience → reduced experience → empty state → error state. Most apps jump straight from full to error. You can do better.

// components/user-avatar.tsx
// A component that gracefully handles multiple failure modes

interface UserAvatarProps {
  user: {
    avatarUrl?: string | null;
    name: string;
    email: string;
  };
  size?: 'sm' | 'md' | 'lg';
}

export function UserAvatar({ user, size = 'md' }: UserAvatarProps) {
  const sizeClass = { sm: 'h-8 w-8', md: 'h-10 w-10', lg: 'h-12 w-12' }[size];

  // Tier 1: Real avatar image
  if (user.avatarUrl) {
    return (
      <Image
        src={user.avatarUrl}
        alt={user.name}
        width={40}
        height={40}
        className={`${sizeClass} rounded-full object-cover`}
        // Tier 2: If image fails to load, fall back to initials
        onError={(e) => {
          const target = e.target as HTMLImageElement;
          target.style.display = 'none';
          target.nextElementSibling?.classList.remove('hidden');
        }}
      />
    );
  }

  // Tier 2: Initials avatar (no external dependency)
  const initials = user.name
    .split(' ')
    .map((n) => n[0])
    .slice(0, 2)
    .join('')
    .toUpperCase();

  return (
    <div
      className={`${sizeClass} rounded-full bg-muted flex items-center justify-center text-sm font-medium`}
      aria-label={user.name}
    >
      {initials || user.email[0].toUpperCase()}
    </div>
  );
  // Tier 3 would be a generic person icon if even name/email is missing
}

This pattern applies everywhere. Search results with no AI ranking still shows results. A recommendations widget that can't load shows your most popular items instead. A dashboard without live data shows yesterday's data with a 'last updated' timestamp. Each tier is better than falling straight to an error.

The operational side: knowing when to degrade

All the patterns above are reactive — they kick in when something fails. You also want proactive degradation. If you know a service is flaky, add feature flags that let you disable it immediately without a deploy. We've been saved more than once by having a flag that turns off a third-party integration without touching code.

Feature flags for each external dependency — flip them off during incidents
Health checks that test your critical dependencies, not just your server
Alerts when fallback paths are being hit at unusual rates (that's a signal)
Runbooks that document what degrades when each service is unavailable
Testing your fallbacks — literally turn off your external service in staging and verify users still get something

That last one is underrated. We've had fallbacks that were broken for weeks because nobody ever tested them — they only get exercised when the real service is down, which is exactly when you don't want to discover they don't work.

Test your fallbacks. Untested fallbacks are just broken code that hasn't had its moment yet.

The peal.dev templates include this kind of resilience thinking from the start — wrapping third-party calls with sensible timeouts, error boundaries at feature level, and clear patterns for adding fallbacks without retrofitting your entire codebase later.

Where to start if your app is currently brittle

Don't try to add graceful degradation everywhere at once. You'll get lost and ship nothing. Here's the order we'd actually recommend:

First: Add timeouts to every external HTTP call. This alone prevents the worst hangs.
Second: Add error boundaries around your major dashboard widgets.
Third: Identify your most unreliable dependency and add a proper fallback for it.
Fourth: Add a feature flag for that same dependency so you can disable it in 30 seconds.
Fifth: Write a test that simulates the failure and verify users see something useful.

Start with the parts of your app that users use most. A broken settings page is annoying. A broken checkout or onboarding flow costs you money. Prioritize based on impact, not on what's easiest to fix.

The goal isn't a perfectly resilient app on day one. It's building the habit of asking 'what happens when this fails?' every time you add a new external dependency. That question, asked consistently, will save you from a lot of 2am incidents where you're frantically reverting code from a gas station parking lot because your phone was the only device with signal. That one's a true story, and we'd rather not repeat it.