Rate Limiting in Next.js — Protecting Your API Without Overcomplicating It

We shipped our first SaaS without any rate limiting. The wake-up call was a Sunday afternoon when someone hammered our AI endpoint 3,000 times in about four minutes — presumably testing limits, possibly a competitor, definitely a jerk. Our OpenAI bill for that day was... educational. We added rate limiting that same evening, cursing ourselves for not doing it earlier.

Rate limiting isn't glamorous, but it's one of those things that separates a hobby project from something you can actually run in production. And the good news: it doesn't have to be complicated. Most apps don't need a distributed rate limiting cluster. They need something that works, doesn't add latency, and takes an hour to set up — not a week.

What You're Actually Protecting Against

Before picking a solution, be clear about your threat model. Rate limiting protects against different things, and mixing them up leads to over-engineering.

Accidental abuse — a bug in someone's frontend that fires the same request in a loop
Scraping — bots hitting your public endpoints to steal data
Credential stuffing — automated login attempts against your auth endpoints
Cost bombs — someone triggering expensive operations (AI calls, email sends, file processing) repeatedly
DoS — deliberate attempts to take your service down

For most indie apps and small SaaS products, you're mainly worried about the first four. Actual DoS attacks at scale require infrastructure-level protection (Cloudflare, your hosting provider's DDoS mitigation) that no amount of application-level rate limiting will fix. Focus on protecting your expensive operations and your auth endpoints first.

The Simplest Approach: In-Memory Rate Limiting

If you're running a single server — or if you're on Vercel but okay with per-instance limits — an in-memory solution is perfectly fine to start. The `lru-cache` package is what you need. Vercel's own `@vercel/edge` package used to recommend exactly this pattern.

// lib/rate-limit.ts
import { LRUCache } from 'lru-cache';

type Options = {
  uniqueTokenPerInterval?: number;
  interval?: number; // in milliseconds
};

export function rateLimit(options?: Options) {
  const tokenCache = new LRUCache<string, number[]>({
    max: options?.uniqueTokenPerInterval ?? 500,
    ttl: options?.interval ?? 60_000,
  });

  return {
    check: (limit: number, token: string) => {
      const tokenCount = tokenCache.get(token) ?? [];
      const now = Date.now();
      const windowStart = now - (options?.interval ?? 60_000);

      // Filter out timestamps outside the current window
      const requestsInWindow = tokenCount.filter((ts) => ts > windowStart);

      if (requestsInWindow.length >= limit) {
        return { success: false, remaining: 0 };
      }

      requestsInWindow.push(now);
      tokenCache.set(token, requestsInWindow);

      return {
        success: true,
        remaining: limit - requestsInWindow.length,
      };
    },
  };
}

Then use it in your Route Handler:

// app/api/generate/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { rateLimit } from '@/lib/rate-limit';

const limiter = rateLimit({
  interval: 60 * 1000, // 1 minute
  uniqueTokenPerInterval: 500,
});

export async function POST(request: NextRequest) {
  // Use IP as the token — or user ID if authenticated
  const ip = request.headers.get('x-forwarded-for') ?? 'anonymous';
  const { success, remaining } = limiter.check(10, ip); // 10 requests per minute

  if (!success) {
    return NextResponse.json(
      { error: 'Rate limit exceeded. Try again in a minute.' },
      {
        status: 429,
        headers: {
          'X-RateLimit-Remaining': '0',
          'Retry-After': '60',
        },
      }
    );
  }

  // Your actual handler logic
  const result = await doExpensiveAIThing(await request.json());

  return NextResponse.json(result, {
    headers: { 'X-RateLimit-Remaining': remaining.toString() },
  });
}

The LRU cache approach works great on a single server. On Vercel or any serverless platform, each function instance gets its own memory — so limits aren't shared across instances. For most apps at most traffic levels, this is actually fine. Don't let perfect be the enemy of good.

When You Need Redis: Upstash Is the Answer

If you're on serverless and you need limits enforced globally across all instances — think login endpoints, payment triggers, anything where per-instance limits create security holes — you need a shared store. Redis is the standard choice, and Upstash is the serverless-friendly way to run it without paying for a dedicated Redis instance.

Upstash has a package called `@upstash/ratelimit` that handles all the sliding window math for you. Pair it with their `@upstash/redis` package and you're done in 20 lines.

// lib/rate-limit-redis.ts
import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis';

// Create reusable instances — don't init these inside request handlers
export const authLimiter = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(5, '1 m'), // 5 attempts per minute
  analytics: true,
  prefix: 'ratelimit:auth',
});

export const apiLimiter = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(60, '1 m'), // 60 requests per minute
  analytics: true,
  prefix: 'ratelimit:api',
});

export const aiLimiter = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.fixedWindow(10, '1 h'), // 10 AI calls per hour
  analytics: true,
  prefix: 'ratelimit:ai',
});

// app/api/auth/login/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { authLimiter } from '@/lib/rate-limit-redis';

export async function POST(request: NextRequest) {
  const ip = request.headers.get('x-forwarded-for') ?? 'unknown';

  // For auth endpoints, also rate limit by email to prevent
  // distributed attacks from multiple IPs against one account
  const body = await request.json();
  const email = body.email?.toLowerCase() ?? 'unknown';

  const { success, limit, remaining, reset } = await authLimiter.limit(
    `${ip}:${email}`
  );

  if (!success) {
    const retryAfter = Math.floor((reset - Date.now()) / 1000);
    return NextResponse.json(
      { error: 'Too many login attempts. Please wait before trying again.' },
      {
        status: 429,
        headers: {
          'X-RateLimit-Limit': limit.toString(),
          'X-RateLimit-Remaining': remaining.toString(),
          'Retry-After': retryAfter.toString(),
        },
      }
    );
  }

  // Proceed with login logic
  return handleLogin(body);
}

The `sliding window` algorithm is usually what you want — it smooths out request distribution so users can't exploit fixed window resets by sending bursts right at the boundary. `fixedWindow` is simpler and slightly cheaper (fewer Redis operations), so it's a reasonable choice for non-critical endpoints.

Rate Limiting at the Middleware Level

Route-level rate limiting is fine, but you end up duplicating the check logic everywhere. A cleaner pattern is doing it in `middleware.ts`, which runs before your route handlers and lets you protect whole groups of routes at once.

// middleware.ts
import { NextRequest, NextResponse } from 'next/server';
import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis';

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(30, '1 m'),
  prefix: 'ratelimit:middleware',
});

export async function middleware(request: NextRequest) {
  // Only rate limit API routes
  if (!request.nextUrl.pathname.startsWith('/api/')) {
    return NextResponse.next();
  }

  const ip = request.headers.get('x-forwarded-for') ?? 'unknown';
  const { success, remaining, reset } = await ratelimit.limit(ip);

  if (!success) {
    return new NextResponse(
      JSON.stringify({ error: 'Rate limit exceeded' }),
      {
        status: 429,
        headers: {
          'Content-Type': 'application/json',
          'Retry-After': Math.floor((reset - Date.now()) / 1000).toString(),
        },
      }
    );
  }

  const response = NextResponse.next();
  response.headers.set('X-RateLimit-Remaining', remaining.toString());
  return response;
}

export const config = {
  matcher: '/api/:path*',
};

One thing to watch: middleware runs on the Edge runtime, which means you can't use Node.js APIs there. Upstash works fine because it's HTTP-based under the hood. Standard Redis clients like `ioredis` will break. Keep that in mind when choosing your approach.

Identifying Users Properly

Using IP addresses is a decent starting point but has real limitations. IPv6, carrier-grade NAT, and VPNs mean multiple real users can share an IP, and one bad actor can rotate IPs. Here's a tiered approach that's more robust:

Unauthenticated endpoints: use IP address, accept it's imperfect
Authenticated endpoints: always use user ID — much more reliable and lets you target specific abusive accounts
Combined key (IP + email) for login endpoints: catches both distributed attacks and local brute force
API key endpoints: rate limit by the API key itself, not the IP — users behind proxies shouldn't get penalized

On Vercel, the real client IP is in `x-forwarded-for`. On other platforms it might be `x-real-ip` or `cf-connecting-ip` if you're behind Cloudflare. Check what your hosting provider actually sets — we got burned by this once when our rate limiting was effectively doing nothing because we were reading the wrong header and getting the load balancer's IP every time.

Different Limits for Different Endpoints

Not all endpoints need the same limits. A public health check endpoint can handle thousands of requests. Your AI generation endpoint should maybe get 10 per hour for free users. Think about cost and sensitivity, not just traffic.

Auth endpoints (login, password reset, magic link): very tight — 5 per minute, 20 per hour
AI/expensive computation endpoints: limit by hour or day, not minute, and tie to user tier
General API endpoints: 60–120 per minute per user is usually fine
Public read endpoints: 200–300 per minute per IP, or don't bother and just use CDN caching
Webhook endpoints: don't rate limit these — your payment provider doesn't like 429s

Never rate limit your incoming webhook endpoints. Stripe, Resend, and every other provider will retry on failure, but if they keep hitting 429s, they'll eventually give up or flag your endpoint. Webhooks should be exempt from rate limiting and protected differently (by validating the signature).

Returning Good Errors

A 429 response should tell the client what to do next, not just that they failed. The `Retry-After` header is standard HTTP and most HTTP clients know how to handle it. The `X-RateLimit-*` headers are convention but widely understood. Use both.

Your error message should be human-readable too, because developers often hit your API manually and need to understand what happened. "Rate limit exceeded" is okay but "You've made 10 requests in the past minute. Limit is 10 per minute. Try again in 23 seconds." is actually useful.

On the frontend, handle 429s gracefully. If you're using React Query or SWR, configure them to respect `Retry-After` instead of hammering the endpoint on error. A simple toast notification telling the user to slow down is better than silent failures or an infinite loading state.

Testing Your Rate Limits

This is the part everyone skips and then wonders why their limits don't work. Test them locally before you ship. With the in-memory approach, you can set a very low limit (2 requests per minute) and verify it works. With Upstash, you can use their free tier in development too — just use a separate Redis database for dev vs. prod.

# Quick test with curl — hit the endpoint rapidly and watch the responses
for i in {1..15}; do
  echo "Request $i:"
  curl -s -o /dev/null -w "%{http_code}" -X POST http://localhost:3000/api/generate \
    -H "Content-Type: application/json" \
    -d '{"prompt": "test"}'
  echo ""
  sleep 0.1
done

# You should see 200s, then 429s after hitting your limit

Run this against your local dev server and confirm you get 429s when expected. Also test that the counter resets after the window passes. It sounds obvious but we've definitely shipped "rate limiting" that never actually triggered because we misconfigured the window size.

What We Actually Use

For the templates we ship at peal.dev, the pattern we've settled on is: in-memory for development and simple single-server deployments, Upstash for anything production-grade on serverless. We have the Upstash setup pre-configured in templates that include auth or AI features, because those are the two places where missing rate limits will definitely bite you.

The setup cost is genuinely low: Upstash's free tier is plenty for early-stage apps (10k requests per day), and you can upgrade when you actually need to. Don't let the Redis conversation make you think this is complex infrastructure — it's one environment variable and a package install.

Start with in-memory rate limiting for non-auth endpoints. Add Upstash for auth and expensive operations. Add it to middleware when you're tired of copy-pasting the check. Do all three of these before you think about anything fancier.

The practical takeaway: pick one approach today and ship it. An imperfect rate limit that exists is infinitely better than the perfect rate limiting architecture you're still designing when someone triggers your AI endpoint 3,000 times on a Sunday. We have the OpenAI invoice to prove it.