Automating Code Reviews with AI: Building a PR Reviewer Agent
Code reviews are one of the most valuable engineering practices — and one of the most time-consuming. Senior engineers spend 20-30% of their time reviewing PRs, and the feedback loop can take days. What if AI could handle the first pass?
I built an Automated PR Reviewer Agent that integrates directly into GitHub CI/CD pipelines. It analyzes code changes, detects issues, and posts actionable review comments — all before a human reviewer even looks at the PR.
Why Automate Code Reviews?
| Problem | Impact |
|---|---|
| PR review bottlenecks | Slows down delivery velocity |
| Inconsistent review quality | Standards vary by reviewer |
| Missed security issues | Humans miss subtle vulnerabilities |
| Style/formatting debates | Wastes engineering time |
| Knowledge silos | Only 1-2 people know certain code areas |
An AI reviewer doesn't replace human reviewers — it augments them by handling the repetitive checks so humans can focus on architecture and logic.
Architecture Overview
GitHub PR Event ──▶ GitHub Actions Workflow
│
▼
FastAPI Service
│
┌─────┴─────┐
│ CrewAI │
│ Agents │
├───────────┤
│ Security │ ──▶ Check for vulnerabilities
│ Quality │ ──▶ Code quality analysis
│ Style │ ──▶ Style & convention checks
└─────┬─────┘
│
▼
GitHub PR Comments
(Inline review feedback)
The Multi-Agent Approach with CrewAI
Instead of one monolithic reviewer, I use specialized agents for different review aspects:
from crewai import Agent, Task, Crew
# Agent 1: Security Reviewer
security_reviewer = Agent(
role="Security Code Reviewer",
goal="Identify security vulnerabilities in Python code changes",
backstory="""You are a senior security engineer specialized in Python
security. You look for SQL injection, XSS, insecure deserialization,
hardcoded secrets, and OWASP Top 10 vulnerabilities.""",
llm=llm,
verbose=True
)
# Agent 2: Code Quality Reviewer
quality_reviewer = Agent(
role="Code Quality Reviewer",
goal="Identify code quality issues like complexity, duplication, and poor naming",
backstory="""You are a senior Python developer who values clean code.
You look for cyclomatic complexity, DRY violations, poor variable naming,
missing error handling, and SOLID principle violations.""",
llm=llm,
verbose=True
)
# Agent 3: Style & Convention Reviewer
style_reviewer = Agent(
role="Python Style Reviewer",
goal="Ensure code follows PEP 8, type hints, and project conventions",
backstory="""You enforce Python coding standards. You check for PEP 8
compliance, proper type hints, docstring conventions, and import ordering.""",
llm=llm,
verbose=True
)
Defining Review Tasks
def create_review_tasks(diff: str, file_path: str):
security_task = Task(
description=f"""
Review this code diff for security vulnerabilities:
File: {file_path}
Diff:
```
{diff}
```
For each issue found, provide:
1. Line number
2. Severity (critical/high/medium/low)
3. Description of the vulnerability
4. Suggested fix with code example
""",
agent=security_reviewer,
expected_output="List of security findings with line numbers and fixes"
)
quality_task = Task(
description=f"""
Review this code diff for quality issues:
File: {file_path}
Diff:
```
{diff}
```
Focus on: complexity, error handling, naming, duplication.
""",
agent=quality_reviewer,
expected_output="List of quality findings with suggestions"
)
return [security_task, quality_task]
Running the Crew
crew = Crew(
agents=[security_reviewer, quality_reviewer, style_reviewer],
tasks=create_review_tasks(diff, file_path),
verbose=True,
process="sequential" # Security first, then quality
)
result = crew.kickoff()
GitHub Actions Integration
The magic happens in the CI/CD pipeline:
# .github/workflows/ai-review.yml
name: AI Code Review
on:
pull_request:
types: [opened, synchronize]
jobs:
ai-review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Get PR Diff
id: diff
run: |
git diff origin/$GITHUB_BASE_REF...HEAD > pr_diff.txt
- name: Run AI Review
env:
OPENAI_API_KEY: (from secrets)
GITHUB_TOKEN: (from secrets)
run: |
python review_agent.py \
--diff pr_diff.txt \
--repo $GITHUB_REPOSITORY \
--pr $PR_NUMBER
Posting Review Comments via GitHub API
import httpx
async def post_review_comment(
repo: str,
pr_number: int,
findings: list[dict],
github_token: str
):
"""Post inline review comments on the PR."""
headers = {
"Authorization": f"Bearer {github_token}",
"Accept": "application/vnd.github.v3+json"
}
comments = []
for finding in findings:
comments.append({
"path": finding["file"],
"line": finding["line"],
"body": format_review_comment(finding)
})
# Create a review with all comments
review_payload = {
"body": f"🤖 **AI Code Review** — Found {len(findings)} issues",
"event": "COMMENT",
"comments": comments
}
async with httpx.AsyncClient() as client:
response = await client.post(
f"https://api.github.com/repos/{repo}/pulls/{pr_number}/reviews",
headers=headers,
json=review_payload
)
return response.json()
def format_review_comment(finding: dict) -> str:
severity_emoji = {
"critical": "🔴",
"high": "🟠",
"medium": "🟡",
"low": "🔵"
}
emoji = severity_emoji.get(finding["severity"], "⚪")
return f"""{emoji} **{finding['severity'].upper()}**: {finding['description']}
**Suggested Fix:**
```python
{finding['suggested_fix']}
```
"""
Smart Filtering: Avoiding Noise
The biggest challenge with AI reviews is noise — flagging things that don't matter. Here's how I reduce false positives:
1. Only Review Changed Lines
def extract_changed_lines(diff: str) -> dict[str, list[int]]:
"""Parse git diff to extract only added/modified line numbers."""
changed = {}
current_file = None
for line in diff.split("\n"):
if line.startswith("+++ b/"):
current_file = line[6:]
changed[current_file] = []
elif line.startswith("@@"):
# Parse hunk header for line numbers
match = re.search(r'\+(\d+)', line)
if match:
current_line = int(match.group(1))
elif line.startswith("+") and not line.startswith("+++"):
changed[current_file].append(current_line)
current_line += 1
return changed
2. Confidence Threshold
Only post comments when the AI is confident:
MIN_CONFIDENCE = 0.7
findings = [f for f in raw_findings if f["confidence"] >= MIN_CONFIDENCE]
3. Deduplication
Don't flag the same pattern multiple times:
seen_patterns = set()
unique_findings = []
for finding in findings:
pattern_key = (finding["type"], finding["description"][:100])
if pattern_key not in seen_patterns:
seen_patterns.add(pattern_key)
unique_findings.append(finding)
Results After 3 Months
| Metric | Before | After |
|---|---|---|
| Avg. review turnaround | 18 hours | 22 minutes (AI) + 4 hours (human) |
| Security issues caught pre-merge | ~60% | ~94% |
| Lines of review comments per PR | 3-5 | 8-12 (AI) + 2-3 (human) |
| Developer satisfaction | "Reviews take forever" | "AI catches the boring stuff" |
Lessons Learned
-
Start with security only: Initially, I deployed only the security reviewer. Quality and style reviews were added after the team trusted the tool.
-
Make it opt-out, not opt-in: If developers have to manually trigger the review, they won't. Make it run on every PR automatically.
-
Never auto-reject: The AI posts comments but never blocks merging. Humans make the final call.
-
Iterate on prompts: The review quality improved dramatically over 2-3 months of prompt refinement based on real PR feedback.
-
Cost control: Use GPT-4 only for security reviews (high stakes). Use GPT-3.5 for style checks (lower stakes, cheaper).
Check out the full project on GitHub.