AI Code Review: Using LLMs to Catch Bugs Before Humans Do

Last month, an LLM caught a race condition in our subscription webhook handler that Stefan and I had both reviewed and missed. The kind of bug that only shows up when two Stripe events arrive 50ms apart and you've got three users hitting upgrade at the same time on a Tuesday. We'd been staring at that code for 20 minutes. The model flagged it in seconds. That was the moment we stopped treating AI code review as a novelty and started treating it as infrastructure.

This post is about how we actually use LLMs to review code — not the theoretical "AI will replace developers" take, but the practical "here's the prompt, here's the output, here's where it still falls flat" breakdown. We'll cover what LLMs are genuinely good at catching, what they miss, and how to integrate this into a real PR workflow without losing your mind.

What LLMs Actually Catch (That Humans Often Miss)

Human reviewers are great at architecture decisions, domain context, and naming. We're bad at exhaustive state enumeration. When you're reading code for the fifth hour of a review session, your brain pattern-matches to "this looks right" instead of actually tracing every code path. LLMs don't get tired. They'll trace every branch with the same attention every time.

Here's what we've seen LLMs reliably flag:

Missing error handling in async functions — particularly try/catch gaps where a thrown error would silently swallow
Race conditions in concurrent operations — especially when you're not awaiting properly or sharing mutable state
Off-by-one errors in pagination, slicing, or index logic
SQL injection and XSS vectors in code that constructs strings instead of using parameterized queries
Auth logic mistakes — like checking the wrong field, comparing to undefined, or missing a role check on one branch of an if statement
Unhandled promise rejections in event handlers
Type coercion bugs — the classic `== null` when you meant `=== null`, or a number compared to a string

These are the bugs that slip through human review because they're syntactically valid and logically plausible at a glance. An LLM has no glance — it reads everything.

The Prompt That Actually Works

Generic "review this code" prompts get you generic output. You'll get "consider adding comments" and "this function is quite long" — stuff that wastes everyone's time. The key is specificity about what kind of review you want.

Here's the system prompt we settled on after a few weeks of iteration:

const REVIEW_SYSTEM_PROMPT = `You are a senior backend engineer reviewing a pull request.
Focus exclusively on bugs, security issues, and correctness problems.

Do NOT comment on:
- Code style or formatting
- Naming conventions (unless genuinely confusing)
- Performance micro-optimizations
- Things that are opinions, not bugs

DO flag:
- Logic errors and incorrect assumptions
- Missing error handling
- Race conditions or concurrency bugs
- Security vulnerabilities (injection, auth bypass, etc.)
- Unhandled edge cases (null/undefined, empty arrays, negative numbers)
- Incorrect use of async/await
- Type mismatches that TypeScript might not catch at runtime

For each issue found:
1. Quote the exact line(s)
2. Explain what goes wrong and under what condition
3. Suggest the fix (concrete code, not theory)

If you find no real bugs, say so. Don't pad the review.`;

The "if you find no real bugs, say so" line is important. Without it, the model will invent issues to seem useful. We learned that one after getting a three-paragraph review about a two-line utility function that was completely correct.

Wiring It Into Your PR Workflow

You can run this manually, but the real value comes when it's automatic — every PR gets reviewed before a human even looks at it. Here's a GitHub Actions workflow that does exactly that using the OpenAI API:

# .github/workflows/ai-review.yml
name: AI Code Review

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  review:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
      contents: read

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get diff
        id: diff
        run: |
          git diff origin/${{ github.base_ref }}...HEAD \
            -- '*.ts' '*.tsx' '*.js' \
            > diff.txt
          echo "size=$(wc -c < diff.txt)" >> $GITHUB_OUTPUT

      - name: Run AI review
        if: steps.diff.outputs.size != '0'
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
          REPO: ${{ github.repository }}
        run: |
          DIFF=$(cat diff.txt)
          RESPONSE=$(curl -s https://api.openai.com/v1/chat/completions \
            -H "Authorization: Bearer $OPENAI_API_KEY" \
            -H "Content-Type: application/json" \
            -d "$(jq -n \
              --arg diff "$DIFF" \
              '{
                model: "gpt-4o",
                messages: [
                  {role: "system", content: "You are a senior backend engineer reviewing a pull request. Focus exclusively on bugs and security issues. Quote exact lines, explain conditions, provide concrete fixes. If no bugs: say so."},
                  {role: "user", content: ("Review this diff:\n\n" + $diff)}
                ],
                max_tokens: 2000
              }')")
          
          COMMENT=$(echo $RESPONSE | jq -r '.choices[0].message.content')
          
          gh api repos/$REPO/issues/$PR_NUMBER/comments \
            -f body="### AI Code Review\n\n$COMMENT"

Keep your diff to TypeScript/JavaScript files. Feeding CSS diffs to a code reviewer is just wasting tokens and money. Filter aggressively.

One gotcha: large PRs will hit token limits. We cap diffs at around 8,000 tokens and split anything larger into file-by-file reviews. It's not perfect, but a 4,000-line PR probably needs to be broken up anyway.

A Real Example: The Webhook Bug It Found

Here's a simplified version of the webhook code that got flagged. See if you can spot it before reading the model's output:

// Before — the buggy version
export async function handleSubscriptionUpdated(
  event: Stripe.Event
) {
  const subscription = event.data.object as Stripe.Subscription;
  const customerId = subscription.customer as string;

  const user = await db.query.users.findFirst({
    where: eq(users.stripeCustomerId, customerId),
  });

  if (!user) return; // customer not in our DB, skip

  await db
    .update(users)
    .set({
      subscriptionStatus: subscription.status,
      subscriptionPeriodEnd: new Date(subscription.current_period_end * 1000),
    })
    .where(eq(users.id, user.id));

  // Send confirmation email if newly active
  if (subscription.status === 'active') {
    await sendSubscriptionConfirmationEmail(user.email);
  }
}

The model's output: "Line 22 — `sendSubscriptionConfirmationEmail` will fire on every `customer.subscription.updated` event where status is 'active', not just the initial activation. A subscription renewal, a payment retry that succeeds, or a plan change will all trigger this email. Under concurrent webhook delivery (Stripe retries), users could receive multiple confirmation emails for the same event. Suggested fix: check if the previous status (`event.data.previous_attributes?.status`) was not 'active' before sending."

Correct. We'd tested happy path and missed this. The fix:

// After — the fixed version
export async function handleSubscriptionUpdated(
  event: Stripe.Event
) {
  const subscription = event.data.object as Stripe.Subscription;
  const customerId = subscription.customer as string;
  const previousAttributes = event.data.previous_attributes as
    | Partial<Stripe.Subscription>
    | undefined;

  const user = await db.query.users.findFirst({
    where: eq(users.stripeCustomerId, customerId),
  });

  if (!user) return;

  await db
    .update(users)
    .set({
      subscriptionStatus: subscription.status,
      subscriptionPeriodEnd: new Date(subscription.current_period_end * 1000),
    })
    .where(eq(users.id, user.id));

  // Only send email on transition to active, not on every active event
  const justBecameActive =
    subscription.status === 'active' &&
    previousAttributes?.status !== undefined &&
    previousAttributes.status !== 'active';

  if (justBecameActive) {
    await sendSubscriptionConfirmationEmail(user.email);
  }
}

Where LLMs Still Fail at Code Review

We're not pretending this is magic. There are categories of bugs that LLMs consistently miss, and you should know them so you don't get overconfident.

Business logic errors — the model doesn't know your domain. If your pricing logic is wrong according to your business rules, the LLM will think it's correct code.
Cross-file bugs — unless you feed the full context, it can't catch issues that span multiple files. A function that looks fine in isolation might be called wrong everywhere.
Performance problems at scale — it might not flag a database query that's fine at 100 rows but destroys production at 10 million.
Infrastructure and deployment issues — code that's syntactically correct but misconfigures environment variables, wrong CORS headers, missing rate limits at the infra level.
Test quality — it can tell you tests exist, but not whether they actually cover the right scenarios.
Framework-specific footguns — Next.js App Router caching behavior, React concurrent mode subtleties, stuff that requires knowing the runtime deeply.

The model also hallucinates sometimes. It'll flag a "bug" that isn't a bug. We've seen it invent type errors that TypeScript would immediately catch, and complain about missing null checks that are literally handled three lines above. Always read the actual issue before jumping to fix it.

Making It Part of the Culture, Not Just a Tool

The failure mode we see in teams that adopt AI review is treating it as a gate — the bot approves, therefore it's fine, merge it. That's backwards. The bot is a first pass, not a final verdict. It catches the stuff that's embarrassing to miss. Humans catch the stuff that matters for the product.

Our actual process: AI review runs automatically on every PR and posts a comment. Stefan and I both read that comment before we start our human review. If the bot found something real, we fix it first. If the bot found nothing, we still do a human review — but we start with more confidence that the basics are covered.

The other thing we do: when the bot misses a bug that we catch in human review, we add it to a running doc of "things the AI misses." Over time this informs how we structure our prompts and what we know to look for manually.

AI review is a force multiplier, not a replacement. It lets you bring your human attention to the hard problems instead of wasting it on the obvious ones.

A Lighter-Weight Option: Review in Your Editor

If you're not ready to set up a GitHub Actions workflow, you can get 80% of the value just by making it a habit to paste your diff into Claude or GPT before pushing. It takes 30 seconds. We did this manually for two months before automating it, and it still caught real issues.

In Cursor, you can also just select a function, hit Cmd+K, and ask "what could go wrong with this code" — the inline response is often surprisingly good for small, contained pieces of logic. Not as thorough as a full diff review, but it catches things mid-flow before you even write the test.

The peal.dev templates we ship all go through both layers — automated AI review in CI and human review before we publish. Not because we're paranoid, but because the templates need to work reliably for people who are building on them without knowing all the edge cases we ran into. It's the kind of thing you only care about after you've been burned by a template that looked fine but had a subtle auth bug.

The Practical Takeaway

Add AI code review to your workflow this week. Not "someday" — this week. Start with the manual version: before every PR, paste your diff into Claude with the system prompt from above. If you want automation, the GitHub Actions workflow above is copy-pasteable with just your secrets configured. You'll catch something real within a few PRs. When you do, it'll stop feeling like a novelty.

The goal isn't to replace code review — it's to make sure human review time goes toward the problems that actually require a human. Architecture decisions, business logic, "does this make sense for what we're building" — that's where your brain should be. Let the model handle the mechanical correctness check.