Automating Code Reviews with AI: Building a PR Reviewer Agent

2025-03-25AI, Code Review, GitHub Actions, CrewAI, Python

Code reviews are one of the most valuable engineering practices — and one of the most time-consuming. Senior engineers spend 20-30% of their time reviewing PRs, and the feedback loop can take days. What if AI could handle the first pass?

I built an Automated PR Reviewer Agent that integrates directly into GitHub CI/CD pipelines. It analyzes code changes, detects issues, and posts actionable review comments — all before a human reviewer even looks at the PR.

Why Automate Code Reviews?

ProblemImpact
PR review bottlenecksSlows down delivery velocity
Inconsistent review qualityStandards vary by reviewer
Missed security issuesHumans miss subtle vulnerabilities
Style/formatting debatesWastes engineering time
Knowledge silosOnly 1-2 people know certain code areas

An AI reviewer doesn't replace human reviewers — it augments them by handling the repetitive checks so humans can focus on architecture and logic.

Architecture Overview

GitHub PR Event ──▶ GitHub Actions Workflow
                          │
                          ▼
                    FastAPI Service
                          │
                    ┌─────┴─────┐
                    │  CrewAI    │
                    │  Agents    │
                    ├───────────┤
                    │ Security  │ ──▶ Check for vulnerabilities
                    │ Quality   │ ──▶ Code quality analysis
                    │ Style     │ ──▶ Style & convention checks
                    └─────┬─────┘
                          │
                          ▼
                    GitHub PR Comments
                    (Inline review feedback)

The Multi-Agent Approach with CrewAI

Instead of one monolithic reviewer, I use specialized agents for different review aspects:

from crewai import Agent, Task, Crew

# Agent 1: Security Reviewer
security_reviewer = Agent(
    role="Security Code Reviewer",
    goal="Identify security vulnerabilities in Python code changes",
    backstory="""You are a senior security engineer specialized in Python
    security. You look for SQL injection, XSS, insecure deserialization,
    hardcoded secrets, and OWASP Top 10 vulnerabilities.""",
    llm=llm,
    verbose=True
)

# Agent 2: Code Quality Reviewer
quality_reviewer = Agent(
    role="Code Quality Reviewer",
    goal="Identify code quality issues like complexity, duplication, and poor naming",
    backstory="""You are a senior Python developer who values clean code.
    You look for cyclomatic complexity, DRY violations, poor variable naming,
    missing error handling, and SOLID principle violations.""",
    llm=llm,
    verbose=True
)

# Agent 3: Style & Convention Reviewer
style_reviewer = Agent(
    role="Python Style Reviewer",
    goal="Ensure code follows PEP 8, type hints, and project conventions",
    backstory="""You enforce Python coding standards. You check for PEP 8
    compliance, proper type hints, docstring conventions, and import ordering.""",
    llm=llm,
    verbose=True
)

Defining Review Tasks

def create_review_tasks(diff: str, file_path: str):
    security_task = Task(
        description=f"""
        Review this code diff for security vulnerabilities:

        File: {file_path}
        Diff:
        ```
        {diff}
        ```

        For each issue found, provide:
        1. Line number
        2. Severity (critical/high/medium/low)
        3. Description of the vulnerability
        4. Suggested fix with code example
        """,
        agent=security_reviewer,
        expected_output="List of security findings with line numbers and fixes"
    )

    quality_task = Task(
        description=f"""
        Review this code diff for quality issues:

        File: {file_path}
        Diff:
        ```
        {diff}
        ```

        Focus on: complexity, error handling, naming, duplication.
        """,
        agent=quality_reviewer,
        expected_output="List of quality findings with suggestions"
    )

    return [security_task, quality_task]

Running the Crew

crew = Crew(
    agents=[security_reviewer, quality_reviewer, style_reviewer],
    tasks=create_review_tasks(diff, file_path),
    verbose=True,
    process="sequential"  # Security first, then quality
)

result = crew.kickoff()

GitHub Actions Integration

The magic happens in the CI/CD pipeline:

# .github/workflows/ai-review.yml
name: AI Code Review

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  ai-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get PR Diff
        id: diff
        run: |
          git diff origin/$GITHUB_BASE_REF...HEAD > pr_diff.txt

      - name: Run AI Review
        env:
          OPENAI_API_KEY: (from secrets)
          GITHUB_TOKEN: (from secrets)
        run: |
          python review_agent.py \
            --diff pr_diff.txt \
            --repo $GITHUB_REPOSITORY \
            --pr $PR_NUMBER

Posting Review Comments via GitHub API

import httpx

async def post_review_comment(
    repo: str,
    pr_number: int,
    findings: list[dict],
    github_token: str
):
    """Post inline review comments on the PR."""
    headers = {
        "Authorization": f"Bearer {github_token}",
        "Accept": "application/vnd.github.v3+json"
    }

    comments = []
    for finding in findings:
        comments.append({
            "path": finding["file"],
            "line": finding["line"],
            "body": format_review_comment(finding)
        })

    # Create a review with all comments
    review_payload = {
        "body": f"🤖 **AI Code Review** — Found {len(findings)} issues",
        "event": "COMMENT",
        "comments": comments
    }

    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"https://api.github.com/repos/{repo}/pulls/{pr_number}/reviews",
            headers=headers,
            json=review_payload
        )
        return response.json()

def format_review_comment(finding: dict) -> str:
    severity_emoji = {
        "critical": "🔴",
        "high": "🟠",
        "medium": "🟡",
        "low": "🔵"
    }
    emoji = severity_emoji.get(finding["severity"], "⚪")

    return f"""{emoji} **{finding['severity'].upper()}**: {finding['description']}

**Suggested Fix:**
```python
{finding['suggested_fix']}
```
"""

Smart Filtering: Avoiding Noise

The biggest challenge with AI reviews is noise — flagging things that don't matter. Here's how I reduce false positives:

1. Only Review Changed Lines

def extract_changed_lines(diff: str) -> dict[str, list[int]]:
    """Parse git diff to extract only added/modified line numbers."""
    changed = {}
    current_file = None

    for line in diff.split("\n"):
        if line.startswith("+++ b/"):
            current_file = line[6:]
            changed[current_file] = []
        elif line.startswith("@@"):
            # Parse hunk header for line numbers
            match = re.search(r'\+(\d+)', line)
            if match:
                current_line = int(match.group(1))
        elif line.startswith("+") and not line.startswith("+++"):
            changed[current_file].append(current_line)
            current_line += 1

    return changed

2. Confidence Threshold

Only post comments when the AI is confident:

MIN_CONFIDENCE = 0.7

findings = [f for f in raw_findings if f["confidence"] >= MIN_CONFIDENCE]

3. Deduplication

Don't flag the same pattern multiple times:

seen_patterns = set()
unique_findings = []

for finding in findings:
    pattern_key = (finding["type"], finding["description"][:100])
    if pattern_key not in seen_patterns:
        seen_patterns.add(pattern_key)
        unique_findings.append(finding)

Results After 3 Months

MetricBeforeAfter
Avg. review turnaround18 hours22 minutes (AI) + 4 hours (human)
Security issues caught pre-merge~60%~94%
Lines of review comments per PR3-58-12 (AI) + 2-3 (human)
Developer satisfaction"Reviews take forever""AI catches the boring stuff"

Lessons Learned

  1. Start with security only: Initially, I deployed only the security reviewer. Quality and style reviews were added after the team trusted the tool.

  2. Make it opt-out, not opt-in: If developers have to manually trigger the review, they won't. Make it run on every PR automatically.

  3. Never auto-reject: The AI posts comments but never blocks merging. Humans make the final call.

  4. Iterate on prompts: The review quality improved dramatically over 2-3 months of prompt refinement based on real PR feedback.

  5. Cost control: Use GPT-4 only for security reviews (high stakes). Use GPT-3.5 for style checks (lower stakes, cheaper).


Check out the full project on GitHub.