Back to all work
Case Study
September 10, 2024

Building an AI Code Review Agent

Experimenting with an automated PR review system using LLMs to catch bugs and enforce coding standards in the development workflow.

Role
Timeline
Tags
Developer ToolsAI AgentsCode QualityAutomation

Overview

This project explores using LLMs to automate code review. The idea: what if an AI agent could do the first pass on PRs, catching obvious issues before human reviewers step in?

Goals:

  • Catch bugs, security risks, and style violations automatically
  • Provide instant feedback to developers
  • Free up human reviewers for architecture and design discussions
  • Make code review consistent across the team

Approach

The agent integrates with GitHub webhooks and uses Claude to analyze diffs:

  1. Analyze pull requests when opened or updated
  2. Identify potential issues (bugs, security risks, style problems)
  3. Leave inline comments with specific suggestions
  4. Configurable severity levels (block, warn, info)

How It Works

When a PR is opened:

  1. Webhook triggers the review agent
  2. Diff analysis: Extract changed files and code context
  3. LLM review: Claude analyzes the diff with custom prompts for each file type
  4. Comment posting: Agent leaves inline feedback on specific lines
  5. Summary report: High-level overview of findings with severity breakdown

Review Categories

The agent checks for:

  • Bugs: Null pointer risks, off-by-one errors, resource leaks
  • Security: SQL injection, XSS vulnerabilities, hardcoded secrets
  • Performance: Inefficient algorithms, unnecessary DB calls, memory leaks
  • Style: Naming conventions, code organization, documentation gaps
  • Best practices: Error handling, logging, test coverage

Architecture

  • GitHub Integration: Webhook listener for PR events
  • Diff Parser: Extracts context-aware code snippets for review
  • LLM Reviewer: Claude with role-specific prompts (Python, Go, TypeScript, etc.)
  • Comment Engine: Posts feedback as GitHub review comments with line numbers
  • Config Layer: Team-specific rulesets (severity thresholds, ignored patterns)

What I Learned

1. Context is everything Early versions reviewed code line-by-line without understanding the broader function. We added "context windows" that include surrounding code and docstrings—this dramatically improved suggestion quality.

2. Tune for false positives Initial aggressive settings flooded PRs with comments. We added severity tuning: only block on critical issues, warn on medium, and provide info for minor suggestions.

3. Reviewers still matter The agent catches mechanical issues, but humans handle architecture, design patterns, and team dynamics. We positioned it as "first pass review," not a replacement.

4. Incremental diff review works better Reviewing entire PRs at once was overwhelming. We switched to incremental review: analyze only the latest changes in each push. This kept feedback focused and actionable.

Technical Details

Prompt Engineering

We created specialized prompts for each language and review type:

Python Security Review:
"You are a security-focused code reviewer for Python. Analyze this diff for:
- SQL injection risks (raw query construction)
- Command injection (subprocess, os.system)
- Path traversal vulnerabilities
- Hardcoded credentials or API keys
...
Provide specific line numbers and concrete mitigation steps."

Configurable Rulesets

Teams defined custom rules in .ai-review.yaml:

rules:
  security:
    severity: block
    patterns:
      - hardcoded_credentials
      - sql_injection
  style:
    severity: info
    max_function_length: 50
ignore:
  - "vendor/*"
  - "*.test.ts"

Rate Limiting & Cost Control

  • Only review PRs with < 1,000 lines changed (larger PRs get summary comments)
  • Cache review results to avoid re-reviewing unchanged code
  • Use cheaper models for style checks, advanced models for security/bugs

Outcome

The agent became a core part of the development workflow. Developers rely on it for instant feedback, and reviewers use it to focus on higher-level concerns. The team expanded it to support:

  • Pre-commit hooks: Local review before pushing
  • CI integration: Block merges on critical findings
  • Metrics dashboard: Track code quality trends over time

Next evolution: Training custom models on team-specific codebases and historical reviews for even more context-aware feedback.

🚀 Let's Build Together

Interested in Similar Work?

I'm available for consulting, technical advisory, and collaborative projects. Let's discuss how I can help with your AI initiatives.