Building AI Tools Without a PhD — Practical Patterns for Web Devs

Six months ago, Ștefan spent three days trying to understand attention mechanisms before touching a single line of AI code. Three days. Reading papers, watching lectures, drawing diagrams. Then he shipped exactly nothing. What finally got us building real AI features was accepting one truth: you don't need to understand how the engine works to drive the car.

We've since built a document analyzer, a code review assistant, a template recommendation system, and a handful of half-baked experiments that live in a graveyard folder. This post is everything we wish someone had told us before we started — the patterns that actually work, the ones that sound good but don't, and the mistakes that cost us real money.

Start With the Output, Not the Model

The biggest mistake beginners make is starting with "which model should I use?" That's the wrong question. Start with: what does the output need to look like? If you need structured data, you need structured output. If you need a yes/no decision, you need a yes/no response. If you need a 2000-word essay, you need streaming. The model choice follows from the output requirement, not the other way around.

Practically, this means designing your output schema before writing a single prompt. For anything beyond a simple text response, you want JSON. Every time. Here's what a basic structured output call looks like with the OpenAI SDK:

import OpenAI from 'openai';
import { z } from 'zod';
import { zodResponseFormat } from 'openai/helpers/zod';

const client = new OpenAI();

const ReviewSchema = z.object({
  sentiment: z.enum(['positive', 'negative', 'neutral']),
  score: z.number().min(0).max(10),
  issues: z.array(z.string()),
  suggestions: z.array(z.string()),
  summary: z.string().max(200),
});

type Review = z.infer<typeof ReviewSchema>;

async function analyzeCode(code: string): Promise<Review> {
  const response = await client.beta.chat.completions.parse({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content: 'You are a code reviewer. Analyze the provided code and return structured feedback.',
      },
      {
        role: 'user',
        content: `Review this code:\n\n${code}`,
      },
    ],
    response_format: zodResponseFormat(ReviewSchema, 'review'),
  });

  const result = response.choices[0].message.parsed;
  if (!result) throw new Error('Failed to parse response');
  
  return result;
}

Notice we're using `gpt-4o-mini` not `gpt-4o`. For structured extraction tasks, the mini models are 95% as good at 5% of the cost. We learned this after a $40 bill in one afternoon of testing. Use the big models for complex reasoning, use mini for classification, extraction, and summarization.

Streaming Is Not Optional

If your AI feature generates any text longer than a sentence, you need streaming. Not because it's fancy, but because waiting 8 seconds for a response while staring at a spinner will kill your retention. Users will assume it's broken. We've seen this with our own products — the same feature with streaming vs without, streaming wins every usability test.

The good news: Next.js and the Vercel AI SDK make this almost trivially easy. Here's a streaming route handler that you can drop into any Next.js app:

// app/api/generate/route.ts
import { openai } from '@ai-sdk/openai';
import { streamText } from 'ai';

export const maxDuration = 30;

export async function POST(req: Request) {
  const { prompt, context } = await req.json();

  const result = streamText({
    model: openai('gpt-4o-mini'),
    system: `You are a helpful assistant. Use the provided context to answer questions accurately.
    
Context:
${context}`,
    prompt,
    maxTokens: 1024,
  });

  return result.toDataStreamResponse();
}

// components/GenerateButton.tsx
'use client';

import { useCompletion } from 'ai/react';

export function GenerateButton({ context }: { context: string }) {
  const { completion, complete, isLoading } = useCompletion({
    api: '/api/generate',
    body: { context },
  });

  return (
    <div>
      <button
        onClick={() => complete('Summarize the key points')}
        disabled={isLoading}
      >
        {isLoading ? 'Generating...' : 'Generate Summary'}
      </button>
      {completion && (
        <div className="mt-4 prose">
          {completion}
        </div>
      )}
    </div>
  );
}

The Vercel AI SDK handles all the SSE plumbing. You get real-time token streaming, loading states, and error handling without writing a single line of stream parsing code. If you're not using it, you're doing more work than necessary.

Prompt Engineering Is Just Software Engineering

Stop treating prompts as magic spells you copy from Reddit. Treat them like functions. They have inputs, they have outputs, they have edge cases, and they need tests. The mental model shift from "prompt hacking" to "prompt engineering" is what separates AI features that work reliably from ones that surprise you in production.

Our rules for prompts that hold up:

Put constraints in the system prompt, not the user prompt. The system prompt is your contract with the model.
Be specific about format. Don't say 'be concise' — say 'respond in 3 sentences maximum'.
Give examples for anything non-obvious. One good example beats 200 words of instructions.
Handle the failure case explicitly. Tell the model what to do when it can't answer: 'If you don't know, say exactly: I don't have enough information to answer this.'
Version your prompts. Store them in constants or a config file, not inline strings. You will iterate.
Test with adversarial inputs before you ship. Someone will send something weird.

A prompt that works 90% of the time will embarrass you 10% of the time. That 10% is what your users screenshot and post on Twitter.

RAG Without the Buzzword Overhead

RAG — Retrieval Augmented Generation — sounds complicated. It's not. It's: find relevant content, stuff it into the prompt, ask a question. That's it. The "retrieval" part is where people overcomplicate things.

For most web apps, you don't need a vector database on day one. If you have less than 10,000 documents and each document is under 2000 words, you can get surprisingly far with Postgres full-text search. We ran our first document Q&A feature entirely on `pg_trgm` before we touched any vector tooling. It handled the 80% case perfectly fine.

When you do need vectors, pgvector is your friend. It's a Postgres extension, so you get vectors living right next to your regular data. No separate infrastructure, no new mental model, no new backup strategy. Here's a minimal example:

// lib/embeddings.ts
import OpenAI from 'openai';
import { db } from './db'; // your drizzle/prisma client

const openai = new OpenAI();

export async function embedText(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small', // cheap and good enough
    input: text.slice(0, 8000), // safety trim
  });
  return response.data[0].embedding;
}

export async function findSimilarChunks(
  query: string,
  limit = 5
): Promise<{ content: string; similarity: number }[]> {
  const queryEmbedding = await embedText(query);
  
  // pgvector cosine similarity search
  const results = await db.execute(
    `SELECT content, 1 - (embedding <=> $1::vector) as similarity
     FROM document_chunks
     ORDER BY embedding <=> $1::vector
     LIMIT $2`,
    [JSON.stringify(queryEmbedding), limit]
  );
  
  return results.rows as { content: string; similarity: number }[];
}

export async function answerWithContext(question: string): Promise<string> {
  const chunks = await findSimilarChunks(question);
  const context = chunks
    .filter(c => c.similarity > 0.7) // ignore weak matches
    .map(c => c.content)
    .join('\n\n---\n\n');

  if (!context) {
    return "I don't have enough information to answer that question.";
  }

  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content: `Answer questions using only the provided context. If the context doesn't contain the answer, say so.\n\nContext:\n${context}`,
      },
      { role: 'user', content: question },
    ],
  });

  return response.choices[0].message.content ?? 'No response generated.';
}

The similarity threshold (`0.7` above) is something you tune per use case. Too high and you miss relevant content. Too low and you stuff the prompt with noise. Start at 0.75 and adjust based on what you see.

Costs Will Surprise You — Here's How to Not Get Wrecked

We have a fun story about deploying an AI feature at 2am and waking up to a $200 bill because we forgot to add rate limiting. The feature was working great — users loved it — and they were using it compulsively at 3am. Lesson learned the expensive way.

Cost control patterns that actually work:

Set a maxTokens limit on every completion call. Always. There's no reason to let a response run to 4096 tokens when you need 500.
Rate limit by user, not just by IP. Authenticated users should have a daily or hourly budget. Even generous limits (50 requests/day) prevent abuse.
Log every API call with token counts. You can't optimize what you don't measure. Store model, input tokens, output tokens, and user ID.
Use caching aggressively. If two users ask the same question about your docs, you should not make two API calls. Redis + a hash of the input works fine.
Separate expensive from cheap operations. Use gpt-4o only when the task genuinely needs it. Classification, sentiment, simple extraction — all gpt-4o-mini territory.
Set billing alerts on your OpenAI account. $10, $25, $50 thresholds. You want a text message before you get a surprise.

Error Handling That Doesn't Embarrass You

AI APIs fail in ways that normal APIs don't. Rate limits, context window exceeded, content policy violations, model overloaded, response cut off mid-sentence. Your error handling needs to account for all of these, and you need to communicate meaningfully to users when things go wrong.

import OpenAI from 'openai';

const openai = new OpenAI();

type AIResult<T> =
  | { success: true; data: T }
  | { success: false; error: string; retryable: boolean };

export async function safeCompletion<T>(
  fn: () => Promise<T>
): Promise<AIResult<T>> {
  try {
    const data = await fn();
    return { success: true, data };
  } catch (error) {
    if (error instanceof OpenAI.APIError) {
      // Rate limited — tell the user to wait
      if (error.status === 429) {
        return {
          success: false,
          error: 'Too many requests. Please wait a moment and try again.',
          retryable: true,
        };
      }

      // Content policy violation
      if (error.status === 400 && error.message.includes('content_policy')) {
        return {
          success: false,
          error: "That request can't be processed. Please rephrase your input.",
          retryable: false,
        };
      }

      // Context too long
      if (error.message.includes('context_length_exceeded')) {
        return {
          success: false,
          error: 'Your input is too long. Please shorten it and try again.',
          retryable: false,
        };
      }

      // OpenAI service issues — retry is reasonable
      if (error.status >= 500) {
        return {
          success: false,
          error: 'The AI service is temporarily unavailable. Please try again in a minute.',
          retryable: true,
        };
      }
    }

    // Unknown error — log it, don't expose internals
    console.error('Unexpected AI error:', error);
    return {
      success: false,
      error: 'Something went wrong. We have been notified.',
      retryable: false,
    };
  }
}

The `retryable` flag lets your UI decide whether to show a retry button. Small thing, but it makes the experience feel considered instead of broken.

Ship Small, Learn Fast

The pattern we keep coming back to: build the dumbest version that could possibly work, put it in front of real users, see where it fails. AI features especially need this because what users actually ask is never what you expected during development. We've had features that worked perfectly on our test cases completely fall apart on real user inputs — not because the code was wrong, but because humans are wonderfully unpredictable.

Our current stack for shipping AI features quickly: Next.js API routes for the backend, Vercel AI SDK for streaming, OpenAI for models, pgvector when we need retrieval, and Upstash Redis for rate limiting and caching. If you want a solid base that has all of this wired up already, our templates on peal.dev include auth, payments, and database setup so you can skip straight to the AI-specific work.

The most important thing we can tell you: stop waiting until you understand everything. The API docs are good, the SDKs are excellent, and you will learn 10x faster by shipping something real than by reading another article (including this one). Pick the smallest AI feature that would be genuinely useful in your app, build it this weekend, and fix what breaks next week.

You don't need to understand how embeddings work mathematically to use them correctly. You need to understand what problem they solve and what they cost. The rest is documentation.