Overview
This project explores using LLMs to automate code review. The idea: what if an AI agent could do the first pass on PRs, catching obvious issues before human reviewers step in?
Goals:
- Catch bugs, security risks, and style violations automatically
- Provide instant feedback to developers
- Free up human reviewers for architecture and design discussions
- Make code review consistent across the team
Approach
The agent integrates with GitHub webhooks and uses Claude to analyze diffs:
- Analyze pull requests when opened or updated
- Identify potential issues (bugs, security risks, style problems)
- Leave inline comments with specific suggestions
- Configurable severity levels (block, warn, info)
How It Works
When a PR is opened:
- Webhook triggers the review agent
- Diff analysis: Extract changed files and code context
- LLM review: Claude analyzes the diff with custom prompts for each file type
- Comment posting: Agent leaves inline feedback on specific lines
- Summary report: High-level overview of findings with severity breakdown
Review Categories
The agent checks for:
- Bugs: Null pointer risks, off-by-one errors, resource leaks
- Security: SQL injection, XSS vulnerabilities, hardcoded secrets
- Performance: Inefficient algorithms, unnecessary DB calls, memory leaks
- Style: Naming conventions, code organization, documentation gaps
- Best practices: Error handling, logging, test coverage
Architecture
- GitHub Integration: Webhook listener for PR events
- Diff Parser: Extracts context-aware code snippets for review
- LLM Reviewer: Claude with role-specific prompts (Python, Go, TypeScript, etc.)
- Comment Engine: Posts feedback as GitHub review comments with line numbers
- Config Layer: Team-specific rulesets (severity thresholds, ignored patterns)
What I Learned
1. Context is everything Early versions reviewed code line-by-line without understanding the broader function. We added "context windows" that include surrounding code and docstrings—this dramatically improved suggestion quality.
2. Tune for false positives Initial aggressive settings flooded PRs with comments. We added severity tuning: only block on critical issues, warn on medium, and provide info for minor suggestions.
3. Reviewers still matter The agent catches mechanical issues, but humans handle architecture, design patterns, and team dynamics. We positioned it as "first pass review," not a replacement.
4. Incremental diff review works better Reviewing entire PRs at once was overwhelming. We switched to incremental review: analyze only the latest changes in each push. This kept feedback focused and actionable.
Technical Details
Prompt Engineering
We created specialized prompts for each language and review type:
Python Security Review:
"You are a security-focused code reviewer for Python. Analyze this diff for:
- SQL injection risks (raw query construction)
- Command injection (subprocess, os.system)
- Path traversal vulnerabilities
- Hardcoded credentials or API keys
...
Provide specific line numbers and concrete mitigation steps."
Configurable Rulesets
Teams defined custom rules in .ai-review.yaml:
rules:
security:
severity: block
patterns:
- hardcoded_credentials
- sql_injection
style:
severity: info
max_function_length: 50
ignore:
- "vendor/*"
- "*.test.ts"
Rate Limiting & Cost Control
- Only review PRs with < 1,000 lines changed (larger PRs get summary comments)
- Cache review results to avoid re-reviewing unchanged code
- Use cheaper models for style checks, advanced models for security/bugs
Outcome
The agent became a core part of the development workflow. Developers rely on it for instant feedback, and reviewers use it to focus on higher-level concerns. The team expanded it to support:
- Pre-commit hooks: Local review before pushing
- CI integration: Block merges on critical findings
- Metrics dashboard: Track code quality trends over time
Next evolution: Training custom models on team-specific codebases and historical reviews for even more context-aware feedback.