Streaming AI Responses in Next.js: Real-Time UX with Server-Sent Events

There's a specific kind of bad UX that became extremely common the moment everyone started bolting AI features onto their apps: the user clicks a button, a spinner appears, and then 8 seconds later a wall of text just... materializes. No feedback. No progress. Just waiting and then suddenly: everything at once. It feels broken even when it isn't.

Streaming fixes this. Instead of waiting for the entire response before sending anything to the client, you start pushing tokens as they arrive from the LLM. The user sees text appearing in real-time, which makes 5 seconds feel like 1 second because there's something happening on screen. We've shipped this pattern in several projects now, and the difference in perceived performance is significant enough that it's basically mandatory if you're building anything AI-powered.

Two Ways to Stream: SSE vs WebSockets

Before we write any code, let's settle this quickly. WebSockets are bidirectional — the server and client can both send messages at any time. Server-Sent Events (SSE) are one-directional — server pushes to client only. For AI response streaming, you don't need bidirectional communication after the initial request. The user sends a message (HTTP POST), and then the server streams back the response. SSE is the right tool here.

SSE also has some practical advantages: it works over regular HTTP/1.1, reconnects automatically if the connection drops, and you don't need a special server setup. WebSockets require a persistent connection and can be finicky with certain deployment environments. We wrote a whole post about WebSockets in Next.js and the TLDR is: they're more work than they look. For AI streaming, stick with SSE.

The Basic Setup: Route Handler That Streams

In the Next.js App Router, you can return a ReadableStream from a Route Handler. The browser natively understands this as a streaming response. Here's the minimal version using OpenAI directly:

// app/api/chat/route.ts
import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

export async function POST(req: Request) {
  const { messages } = await req.json();

  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages,
    stream: true,
  });

  const encoder = new TextEncoder();

  const readable = new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        const text = chunk.choices[0]?.delta?.content ?? '';
        if (text) {
          // SSE format: "data: <content>\n\n"
          controller.enqueue(encoder.encode(`data: ${JSON.stringify({ text })}\n\n`));
        }
      }
      controller.enqueue(encoder.encode('data: [DONE]\n\n'));
      controller.close();
    },
  });

  return new Response(readable, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      'Connection': 'keep-alive',
    },
  });
}

The `text/event-stream` content type tells the browser this is an SSE stream. The `Cache-Control: no-cache` header is important — without it, some proxies and CDN edges will buffer the entire response before passing it along, which completely defeats the purpose. `Connection: keep-alive` keeps the HTTP connection open for the duration of the stream.

The SSE format is strict: each message must be `data: <content>\n\n` (two newlines at the end). One newline continues the current message. Two newlines signal the end of a message. Get this wrong and the browser won't fire events.

Reading the Stream on the Client

You can use the native `EventSource` API, but it only supports GET requests and doesn't let you send a body. Since we need to POST the conversation history, we use `fetch` with a readable stream instead:

// hooks/useChat.ts
import { useState, useCallback } from 'react';

export function useChat() {
  const [response, setResponse] = useState('');
  const [isStreaming, setIsStreaming] = useState(false);

  const sendMessage = useCallback(async (messages: Array<{ role: string; content: string }>) => {
    setResponse('');
    setIsStreaming(true);

    try {
      const res = await fetch('/api/chat', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ messages }),
      });

      if (!res.ok) throw new Error(`HTTP error: ${res.status}`);
      if (!res.body) throw new Error('No response body');

      const reader = res.body.getReader();
      const decoder = new TextDecoder();

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        const chunk = decoder.decode(value, { stream: true });
        const lines = chunk.split('\n').filter(line => line.startsWith('data: '));

        for (const line of lines) {
          const data = line.slice(6); // Remove "data: " prefix
          if (data === '[DONE]') continue;

          try {
            const parsed = JSON.parse(data);
            setResponse(prev => prev + parsed.text);
          } catch {
            // Ignore malformed chunks
          }
        }
      }
    } finally {
      setIsStreaming(false);
    }
  }, []);

  return { response, isStreaming, sendMessage };
}

This hook gives you a `response` string that grows in real-time as tokens arrive, an `isStreaming` boolean for showing a cursor or disabling inputs, and a `sendMessage` function. Wire it up to a component and you've got streaming AI responses.

Just Use the Vercel AI SDK (Seriously)

Everything above is worth understanding. But in practice, we use the Vercel AI SDK for this. It handles the streaming protocol, error recovery, message history management, and gives you a `useChat` hook that's better than the one we just wrote. The code you'd have to maintain drops significantly.

// app/api/chat/route.ts — with Vercel AI SDK
import { openai } from '@ai-sdk/openai';
import { streamText } from 'ai';

export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: openai('gpt-4o'),
    messages,
    system: 'You are a helpful assistant.',
  });

  return result.toDataStreamResponse();
}

// components/Chat.tsx — useChat from Vercel AI SDK
'use client';
import { useChat } from 'ai/react';

export function Chat() {
  const { messages, input, handleInputChange, handleSubmit, isLoading } = useChat();

  return (
    <div>
      <div className="messages">
        {messages.map(m => (
          <div key={m.id} className={m.role}>
            {m.content}
          </div>
        ))}
      </div>
      <form onSubmit={handleSubmit}>
        <input
          value={input}
          onChange={handleInputChange}
          placeholder="Say something..."
          disabled={isLoading}
        />
        <button type="submit" disabled={isLoading}>
          {isLoading ? 'Thinking...' : 'Send'}
        </button>
      </form>
    </div>
  );
}

That's a fully functional streaming chat interface. The SDK handles the protocol details, and `toDataStreamResponse()` returns a properly formatted SSE response. It also supports tool calls, which is where things get really interesting — but that's a post of its own.

Handling Errors Without Breaking the Stream

One thing that will absolutely happen: the LLM API will time out, rate-limit you, or return a 500 mid-stream. If you're not handling this, the user just sees the text stop mid-sentence with no explanation. Brutal.

The trick is to catch errors inside the stream itself and send a structured error event before closing:

// Inside your ReadableStream start function
async start(controller) {
  try {
    for await (const chunk of stream) {
      const text = chunk.choices[0]?.delta?.content ?? '';
      if (text) {
        controller.enqueue(encoder.encode(`data: ${JSON.stringify({ text })}\n\n`));
      }
    }
    controller.enqueue(encoder.encode('data: [DONE]\n\n'));
  } catch (error) {
    // Send an error event so the client knows what happened
    const errorMessage = error instanceof Error ? error.message : 'Stream failed';
    controller.enqueue(
      encoder.encode(`data: ${JSON.stringify({ error: errorMessage })}\n\n`)
    );
  } finally {
    controller.close();
  }
}

On the client, check for `parsed.error` in your chunk handler and surface it to the user. Even a simple "Something went wrong, please try again" is infinitely better than silent failure.

Abort Controllers: Let Users Stop Generation

If you've ever used ChatGPT, you know the stop button. It's actually important UX — sometimes the model goes off in the wrong direction and you want to cut it off. This is where `AbortController` comes in:

// In your hook
const abortControllerRef = useRef<AbortController | null>(null);

const sendMessage = useCallback(async (messages) => {
  // Cancel any existing stream
  abortControllerRef.current?.abort();
  abortControllerRef.current = new AbortController();

  const res = await fetch('/api/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ messages }),
    signal: abortControllerRef.current.signal, // Pass signal to fetch
  });

  // ... rest of streaming logic
}, []);

const stopGeneration = useCallback(() => {
  abortControllerRef.current?.abort();
  setIsStreaming(false);
}, []);

// In your Route Handler, forward the signal to OpenAI
export async function POST(req: Request) {
  const { messages } = await req.json();
  
  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages,
    stream: true,
  }, {
    signal: req.signal, // Forward abort signal from the request
  });
  // ...
}

Forwarding `req.signal` to the OpenAI client is important — it means when the user aborts, we also cancel the upstream API call. Otherwise you're paying for tokens that no one will ever see, which adds up.

A Few Things That Will Catch You Out

Vercel Edge Functions have a 25-second timeout. Long AI responses can exceed this. Use Node.js runtime (`export const runtime = 'nodejs'`) for AI routes — the timeout is 60s on Pro, or configure `maxDuration`.
Buffering middleware: If you have any middleware that reads and rewrites the response body, it'll buffer your entire stream. Be careful what you put in `middleware.ts` for AI routes.
Markdown rendering: If you're rendering the streamed text as Markdown, naive implementations re-render the entire string on every token. Use a streaming-aware Markdown renderer or debounce the updates.
Mobile connections: Streams can drop on flaky mobile networks. Building retry logic into your hook is worth it for production — track the last token position and resume from there if you can.
Rate limits: OpenAI rate limits are per-minute token limits, not just per-request. If multiple users are streaming simultaneously, you'll hit them faster than you expect. Build in exponential backoff.

The Markdown rendering one bit us on a real project. We were using `react-markdown` and re-parsing the entire accumulated string on every single token update. On long responses, the UI started stuttering noticeably around 500 tokens. The fix was to accumulate the raw string in a ref and only update the rendered state every 50ms with a debounce.

If you're building an AI-heavy product, streaming isn't a nice-to-have. It's the difference between your app feeling responsive and feeling broken. Ship it from day one.

Starting From a Template

If you're building an AI-powered SaaS and don't want to wire up streaming, auth, payments, and database from scratch every time, our templates at peal.dev include the AI chat patterns covered here as a starting point — so you can focus on the actual product instead of plumbing.

Streaming AI responses is one of those things that sounds complex but has a very clean implementation once you understand the moving parts: SSE format on the server, ReadableStream reader on the client, error events for failures, AbortController for cancellation. The Vercel AI SDK handles most of this for you if you want to move fast. But understanding what's happening underneath means you can debug it when something goes wrong at 2am — and it will.