Building AI-Assisted Code Review Tools with the GPT-4o API
Your team’s pull request queue is backed up, reviewers are fatigued, and subtle bugs are slipping through because humans can only context-switch so many times a day. A senior engineer who just spent four hours in architecture meetings brings diminished attention to the fifteenth PR of the week—and that’s exactly when the null pointer dereference or the unescaped SQL input makes it to production.
The problem is structural, not personal. Code generation velocity is accelerating faster than review throughput can scale. AI-assisted development tools are pushing more code into review queues while the number of qualified reviewers stays roughly flat. Hiring more senior engineers doesn’t solve the bottleneck; it just moves it.
LLMs offer a practical wedge into this problem. Not as a replacement for human judgment—you still need engineers making architectural calls, evaluating tradeoffs, and understanding business context—but as a tireless first-pass reviewer that catches the mechanical failures: missing error handling, inconsistent null checks, hardcoded secrets, known insecure patterns, style violations that distract from real review. An LLM doesn’t get review fatigue at PR number fifteen. It applies the same attention to every diff.
GPT-4o makes this particularly tractable right now. Its code comprehension handles multi-file context coherently, its structured output support means you get machine-parseable results rather than free-form prose, and its latency is low enough to fit inside a CI pipeline without blocking developer flow.
What follows is an end-to-end implementation guide: prompt design, output parsing, and GitHub PR integration—the full stack for shipping an AI code review tool your team will actually use.
Why AI Code Review Is Worth Building Now
The bottleneck in modern software development has shifted. AI-assisted coding tools—Copilot, Cursor, Claude—have dramatically accelerated how fast engineers produce code. Pull requests that once represented a full day’s work now arrive in hours. Review queues are not keeping pace.

This asymmetry is structural, not temporary. A team that adopts AI code generation without augmenting its review process ends up with a throughput problem that compounds over time: more PRs, more context-switching for reviewers, more cognitive load spread across an already-stretched team. The answer is not to slow down code generation. It is to make review faster at the mechanical layer so humans can focus where judgment actually matters.
Where LLMs Fit in the Review Stack
LLMs are exceptionally good at pattern recognition across large bodies of code. They have been trained on millions of repositories, which means they have internalized common antipatterns—N+1 query problems, unhandled error paths, hardcoded credentials, unsafe deserialization, missing input validation—and can surface them consistently without fatigue.
What they are not good at is evaluating whether a system design is right for your team’s operational constraints, whether a new abstraction will age well, or whether a refactor aligns with a migration strategy that exists only in Confluence and institutional memory. That distinction is the architecture of the workflow: LLMs handle the mechanical layer, humans handle the contextual layer.
Why GPT-4o Specifically
GPT-4o brings two capabilities that make it practical for production code review tooling rather than experimentation. First, its context window and code comprehension handle realistic file sizes and cross-file diffs without degrading. Second, its structured output support—via JSON mode and function calling—means you get machine-readable review results you can route, filter, and act on programmatically. A comment flagged as severity: error can block a merge. One flagged as severity: suggestion gets posted as a non-blocking annotation. That programmability is what separates a useful CI tool from a chatbot you paste code into.
The combination of review speed, pattern coverage, and structured output makes this worth building as infrastructure rather than a one-off script.
The next section covers how to structure the API request itself—what goes in the system prompt, how to pass diff context efficiently, and which model parameters to set before you write a single line of review logic.
Anatomy of a Code Review Request: Designing Your API Integration
Before writing a single prompt, you need a client configuration that holds up under production load. That means handling transient network failures gracefully, setting sensible timeouts, and making deliberate choices about which API surface to use.
Initializing the Client
The OpenAI Python SDK exposes an OpenAI class that accepts retry and timeout configuration at initialization time. Set these once and reuse the client across your application rather than instantiating it per request.
import osfrom openai import OpenAIfrom openai._base_client import DEFAULT_MAX_RETRIES
client = OpenAI( api_key=os.environ["OPENAI_API_KEY"], max_retries=3, # exponential backoff on 429 and 5xx timeout=60.0, # seconds; code diffs can be verbose)Pull the API key from the environment rather than hardcoding it. In a CI/CD context you will inject OPENAI_API_KEY as a repository secret, so this pattern carries forward without modification.
💡 Pro Tip: Set
OPENAI_API_KEYin your local shell via a.envrcfile managed bydirenv. This keeps credentials out of your shell history and mirrors the secret-injection behavior of GitHub Actions exactly.
Chat Completions vs. the Responses API
OpenAI currently offers two API surfaces for text generation: the Chat Completions endpoint (/v1/chat/completions) and the newer Responses API (/v1/responses). For a code review tool, the distinction matters.
Chat Completions is stateless. You send the full conversation context on every request and receive a single response. For CI-triggered reviews where each pull request is an independent unit of work, this is the right choice. There is no session state to manage, and the payload structure is well-understood.
The Responses API maintains server-side conversation state across turns using a previous_response_id parameter. This is valuable for interactive review sessions where a developer asks follow-up questions about a specific finding. If your roadmap includes a Slack bot or a web UI on top of the same review engine, the Responses API gives you that without re-sending the full diff on every turn.
For this guide, the CI pipeline integration uses Chat Completions. The interactive layer, if you build it, swaps in the Responses API with minimal changes.
Configuring Model, Temperature, and Token Limits
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], timeout=60.0, max_retries=3)
def request_review(system_prompt: str, diff: str) -> str: response = client.chat.completions.create( model="gpt-4o", temperature=0.2, max_tokens=1024, messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": f"Review the following diff:\n\n{diff}"}, ], ) return response.choices[0].message.contentThree parameters warrant explicit reasoning:
model="gpt-4o": Use the full model, notgpt-4o-mini, for review tasks. Code analysis requires multi-step reasoning over context that spans function boundaries, and the capability gap is measurable in false-negative rate on real diffs.temperature=0.2: Low temperature produces consistent, structured output. At0.0the model can exhibit repetition artifacts on long inputs;0.2is the practical floor for coherent prose while keeping results deterministic enough to diff across runs.max_tokens=1024: Enough headroom for a thorough structured response without allowing runaway generation. You will adjust this upward in Section 4 once you enforce a JSON schema on the output, since structured formats are more token-efficient than free prose.
With the client initialized and the basic request shape established, the next challenge is what you actually say to the model—and how you frame the diff so GPT-4o focuses on the findings that matter to your team rather than producing generic commentary.
Prompt Engineering for Code Analysis
The difference between a useful AI code reviewer and an annoying one comes down entirely to prompt design. Free-form review output is noise—you can’t route it, prioritize it, or integrate it into a CI decision gate. This section covers how to structure your prompts so that GPT-4o returns machine-parseable, severity-tagged review comments every time.
Define the Reviewer Persona and Output Contract
Your system prompt does two jobs: it establishes the reviewer’s identity and expertise, and it locks down the output format before the model generates a single token. Combining both in the system message is more reliable than scattering constraints across the user turn.
SYSTEM_PROMPT = """You are a senior software engineer conducting a pull request review.Your focus areas are: correctness, security, performance, and maintainability.You do not comment on style unless it creates ambiguity or a bug risk.
You MUST respond with a JSON object matching this schema exactly:{ "summary": "<one sentence characterizing the overall change>", "issues": [ { "severity": "critical" | "warning" | "suggestion", "category": "security" | "correctness" | "performance" | "maintainability", "file": "<filename>", "line": <integer or null>, "comment": "<actionable description of the issue>" } ], "approved": <boolean>}
Severity definitions:- critical: must be resolved before merging; introduces a bug, security flaw, or data loss risk- warning: should be addressed but does not block merge; degrades quality or future maintainability- suggestion: take-it-or-leave-it improvement; no meaningful risk if ignored"""Hardcoding the severity taxonomy inside the system prompt prevents the model from inventing gradients like “minor-critical” or “high-warning” that break downstream parsing. The approved boolean gives your CI gate a single field to check without parsing prose.
Diff Context vs. Full File Context
Sending the raw git diff is cheaper and faster, but it loses surrounding context that matters for correctness analysis. The right strategy depends on what you’re reviewing:
Use diff-only context when:
- The change is self-contained (a new utility function, a config update)
- You’re operating under tight token budgets in a high-volume pipeline
- The diff is under ~300 lines
Use full file context when:
- The change modifies shared state, class methods, or middleware
- You need the model to reason about call sites outside the diff
- The file is small enough that full inclusion stays under your context budget
The implementation below handles both modes and injects the appropriate context block into the user message:
def build_user_message(diff: str, full_files: dict[str, str] | None = None) -> str: parts = []
if full_files: for filename, content in full_files.items(): parts.append(f"### Full file: {filename}\n```\n{content}\n```") parts.append("### Diff\n```diff\n" + diff + "\n```") parts.append("Review the diff in the context of the full files above.") else: parts.append("### Diff\n```diff\n" + diff + "\n```") parts.append("Review only the changed lines. Do not speculate about code outside the diff.")
return "\n\n".join(parts)The final instruction in each branch matters. Without it, the model tends to invent context when reviewing a diff that references symbols it cannot see.
Structured Outputs and Schema Enforcement
Use OpenAI’s response_format parameter with json_object to eliminate the most common failure mode—markdown-wrapped JSON that breaks json.loads:
import jsonfrom openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def request_review(system_prompt: str, user_message: str) -> dict: response = client.chat.completions.create( model="gpt-4o", response_format={"type": "json_object"}, messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_message}, ], temperature=0.2, ) return json.loads(response.choices[0].message.content)Temperature at 0.2 reduces run-to-run variation in severity classification without making the model mechanical. At 0.0, GPT-4o sometimes becomes over-conservative and collapses most issues into suggestion.
Few-Shot Examples for Consistent Classification
Even with a precise taxonomy, severity classification drifts across different diff types. A SQL query without parameterization and a missing null check both trigger “correctness,” but one is clearly critical and the other is warning. Few-shot examples anchor the boundary:
FEW_SHOT_EXAMPLES = [ { "role": "user", "content": "### Diff\n```diff\n- query = f\"SELECT * FROM users WHERE id = {user_id}\"\n```" }, { "role": "assistant", "content": json.dumps({ "summary": "Introduces a SQL injection vulnerability via string interpolation.", "issues": [{ "severity": "critical", "category": "security", "file": "db/queries.py", "line": 14, "comment": "Use parameterized queries. Replace with cursor.execute('SELECT * FROM users WHERE id = %s', (user_id,))." }], "approved": False }) }]Prepend these examples between the system message and the actual user message. Two to three examples covering the severity boundaries (critical vs warning, warning vs suggestion) are enough to stabilize classification across diverse diffs.
Pro Tip: Store your few-shot examples in a versioned file alongside your prompt templates. When the model’s classification behavior shifts after an API update, you can diagnose whether the prompt or the model changed by replaying the same examples against both versions.
With structured output flowing reliably out of the API, the next challenge is routing those results to the right place—annotating the PR, triggering blocking checks, or filing issues for deferred work. That’s exactly what the next section covers.
Parsing and Routing Review Output
Once GPT-4o returns a structured response, the real engineering work begins: deserializing that output into typed objects, applying severity-based routing logic, and formatting findings into the exact shape the GitHub API expects. Done well, this layer makes the difference between a demo that works once and a system you can trust in production.
Deserializing into Typed Dataclasses
Assuming your prompt instructs the model to return a JSON array of findings with fields like severity, line, file, message, and suggestion, the first step is mapping that raw output into Python dataclasses:
from dataclasses import dataclass, fieldfrom enum import Enumfrom typing import Optional
class Severity(str, Enum): CRITICAL = "critical" WARNING = "warning" INFO = "info"
@dataclassclass ReviewFinding: file: str line: int severity: Severity message: str suggestion: Optional[str] = None rule_id: Optional[str] = None
def parse_findings(raw_json: list[dict]) -> list[ReviewFinding]: findings = [] for item in raw_json: try: findings.append(ReviewFinding( file=item["file"], line=int(item["line"]), severity=Severity(item["severity"].lower()), message=item["message"], suggestion=item.get("suggestion"), rule_id=item.get("rule_id"), )) except (KeyError, ValueError): continue # drop malformed entries; log in production return findingsThe try/except block is load-bearing, not defensive theater. LLMs occasionally emit structurally valid JSON with semantically invalid field values—an out-of-range severity string, a non-integer line number—and silent dropping with downstream logging is safer than crashing the review pipeline on a single bad finding. In production, replace that continue with a structured log entry so you can monitor malformation rates over time and detect prompt regressions early.
Filtering by Severity to Control Blocking Behavior
Not every finding warrants blocking a merge. Route findings into two buckets: blocking (critical) and advisory (warning, info):
def route_findings( findings: list[ReviewFinding],) -> tuple[list[ReviewFinding], list[ReviewFinding]]: blocking = [f for f in findings if f.severity == Severity.CRITICAL] advisory = [f for f in findings if f.severity != Severity.CRITICAL] return blocking, advisoryThis separation drives your CI exit code: if blocking is non-empty, the step exits non-zero. Advisory findings still get posted as PR comments but never fail the build. Keep this routing logic centralized rather than scattering severity checks across callers—when your team decides to promote WARNING to blocking status, you want a single line to change.
Deduplicating Across Multiple Review Passes
Large diffs often require chunking the diff and running multiple API calls. Without deduplication, the same logic error surfaces as three identical comments on the same line. Use a composite key to deduplicate before posting:
def deduplicate(findings: list[ReviewFinding]) -> list[ReviewFinding]: seen: set[tuple] = set() unique = [] for f in findings: key = (f.file, f.line, f.rule_id or f.message[:80]) if key not in seen: seen.add(key) unique.append(f) return uniquePro Tip: Truncate
messagein the fallback key rather than using the full string. Two semantically identical findings from different prompt passes often have minor wording differences that would incorrectly survive deduplication if you hash the full text. Preferrule_idas the primary discriminator whenever your prompt reliably produces one—it is stable across paraphrasing.
Formatting for the GitHub Pull Request Review API
GitHub’s pull request review comments endpoint expects path, line, body, and commit_id. Transform your findings directly:
def to_github_comment(finding: ReviewFinding, commit_sha: str) -> dict: body_parts = [f"**[{finding.severity.value.upper()}]** {finding.message}"] if finding.suggestion: body_parts.append(f"\n> **Suggestion:** {finding.suggestion}") return { "path": finding.file, "line": finding.line, "side": "RIGHT", "body": "\n".join(body_parts), "commit_id": commit_sha, }
def build_review_payload( findings: list[ReviewFinding], commit_sha: str, event: str = "COMMENT",) -> dict: return { "commit_id": commit_sha, "event": event, # "REQUEST_CHANGES" for blocking findings "comments": [to_github_comment(f, commit_sha) for f in findings], }Pass event="REQUEST_CHANGES" when blocking findings are present; otherwise use "COMMENT" to post advisory notes without blocking the merge. One subtlety worth noting: GitHub enforces a limit on the number of comments per review request. If your deduplicated findings list is large, batch comments into multiple review submissions rather than attempting to post them all in a single API call, which will result in a 422 error and silently drop the entire payload.
With findings parsed, routed, deduplicated, and formatted, you have a payload ready to POST to the GitHub API. The next section wires this entire pipeline into a GitHub Actions workflow so it runs automatically on every pull request.
Integrating into CI/CD with GitHub Actions
With your review pipeline built and output parsing in place, the next step is making it run automatically on every pull request. GitHub Actions gives you the trigger hooks, secret management, and GitHub API access needed to wire this together without any external infrastructure.
The Workflow File
Create .github/workflows/ai-review.yml in your repository:
name: AI Code Review
on: pull_request: types: [opened, synchronize, reopened]
jobs: review: runs-on: ubuntu-latest permissions: pull-requests: write contents: read
steps: - name: Checkout repository uses: actions/checkout@v4 with: fetch-depth: 0
- name: Set up Python uses: actions/setup-python@v5 with: python-version: "3.12"
- name: Install dependencies run: pip install openai requests
- name: Run AI review env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} PR_NUMBER: ${{ github.event.pull_request.number }} REPO: ${{ github.repository }} run: python scripts/ai_review.pyThe on.pull_request trigger fires on three event types: opened (new PR), synchronize (new commits pushed to an existing PR), and reopened (a previously closed PR brought back). Together, these cover every meaningful state change where a review adds value.
The permissions block is required. Without pull-requests: write, the runner cannot post comments. contents: read is needed for the checkout step. GITHUB_TOKEN is provisioned automatically by Actions—you only need to add OPENAI_API_KEY to your repository secrets under Settings → Secrets and variables → Actions. Encrypted secrets are never exposed in logs or to forked PRs by default.
Fetching the Diff Programmatically
Inside scripts/ai_review.py, pull the unified diff directly from GitHub’s REST API:
import osimport requests
def fetch_pr_diff(repo: str, pr_number: str, token: str) -> str: url = f"https://api.github.com/repos/{repo}/pulls/{pr_number}" headers = { "Authorization": f"Bearer {token}", "Accept": "application/vnd.github.v3.diff", } response = requests.get(url, headers=headers) response.raise_for_status() return response.text
repo = os.environ["REPO"]pr_number = os.environ["PR_NUMBER"]token = os.environ["GH_TOKEN"]
diff = fetch_pr_diff(repo, pr_number, token)The application/vnd.github.v3.diff Accept header returns a raw unified diff rather than JSON, which is exactly what your prompt pipeline expects. Pass diff directly into the review function you built in the prompt engineering section. For very large PRs, consider truncating the diff or filtering to changed files that match specific path patterns before sending it to the model—this controls both token cost and response quality.
Posting Inline Comments
Once you have structured review output, post findings as review comments anchored to specific lines:
def post_review_comments(repo: str, pr_number: str, token: str, comments: list[dict]) -> None: url = f"https://api.github.com/repos/{repo}/pulls/{pr_number}/reviews" headers = { "Authorization": f"Bearer {token}", "Accept": "application/vnd.github+json", } payload = { "event": "COMMENT", "comments": [ { "path": c["file"], "line": c["line"], "body": f"**[AI Review]** {c['message']}", "side": "RIGHT", } for c in comments ], } response = requests.post(url, json=payload, headers=headers) response.raise_for_status()Each comment object requires path (the relative file path from the repository root), line (the line number in the diff hunk), and body. The side field specifies which side of the diff to anchor to—RIGHT targets the new version of the file, which is almost always what you want for review feedback. Your structured output parser from the previous section produces exactly this shape, so the two pieces connect cleanly.
Pro Tip: Batch all comments into a single
POST /reviewsrequest rather than making one request per comment. GitHub’s API processes them atomically, and you avoid rate-limiting issues on large diffs with many findings. If the review produces zero findings, skip the API call entirely rather than posting an empty review, which creates noise in the PR timeline.
Handling the Token Secret
Never echo OPENAI_API_KEY to stdout in your script or workflow steps. GitHub Actions automatically redacts registered secrets from logs, but any string you explicitly print bypasses that protection. Keep the key in os.environ and pass it only to the OpenAI client constructor. If your organization rotates API keys on a schedule, store the secret at the organization level rather than per-repository so updates propagate automatically.
With the workflow committed to your default branch, every new PR triggers a review within seconds of the first push. Reviewers see inline comments before they even open the diff themselves.
That automation surface is powerful, but it introduces real cost and latency variables that grow non-linearly with team size and PR volume—which is exactly what the next section addresses.
Controlling Cost, Latency, and Review Quality at Scale
Shipping an AI code review tool to a handful of PRs is straightforward. Keeping it economical and accurate across hundreds of PRs per day requires deliberate engineering on three fronts: token management, caching, and outcome tracking.

Chunking Large Diffs Without Losing Context
GPT-4o’s 128k context window is generous, but monorepo PRs routinely produce diffs that exceed it—especially when migrations or generated files are touched. The naive approach of truncating diffs produces incoherent reviews. Instead, split diffs at file boundaries and group files by functional area: routes with their corresponding tests, schema files with migrations. Each chunk gets reviewed independently, then results are merged before posting to the PR.
The critical constraint is preserving enough surrounding context within each chunk. Strip purely additive whitespace and generated code (lock files, vendored assets) before chunking. A 5,000-token file-level budget per chunk, with 2,000 tokens reserved for system prompt and response, keeps you safely within limits while producing focused, actionable feedback.
Caching File-Level Reviews on Rebases
Rebases are the silent cost driver in high-velocity teams. A developer rebases onto main before merge, and the CI pipeline re-reviews every file—even ones that haven’t changed. The fix is straightforward: hash each file’s diff content before the API call, and store the resulting review in a short-lived cache keyed on that hash. Redis with a 24-hour TTL works well. On a rebase where only two files have new changes, you serve the other eight reviews from cache and issue two API calls instead of ten.
💡 Pro Tip: Include the model version and system prompt hash in your cache key. A prompt update should invalidate cached reviews automatically, not silently return stale feedback.
Tracking Spend with tiktoken
Set a per-PR token budget before dispatching requests. Use tiktoken to count tokens in your assembled prompt client-side, abort and post a warning comment if the budget is exceeded, and log actual usage from each API response. Aggregate this in a time-series store and alert when a single PR exceeds 50,000 tokens—that’s a signal to tune your chunking logic, not simply accept the bill.
Measuring Review Signal Quality
Cost metrics tell you what you’re spending; outcome metrics tell you whether it’s worth it. Instrument your GitHub bot to track two events per comment: resolved (the author addressed it before merge) and dismissed (closed without action). A resolution rate below 40% on a comment category indicates prompt drift or a category that engineers don’t trust—either tighten the prompt or remove that check.
Expose these metrics in a dashboard alongside model latency per chunk. Review quality degrades subtly over time as codebases evolve, and you need visibility before your team starts ignoring the bot entirely.
With cost and quality instrumented, the harder question becomes what the tool fundamentally cannot do—and how to communicate those limits to your team.
What This Doesn’t Replace and Where to Go Next
The tool you’ve built is a force multiplier for review throughput, not a replacement for engineering judgment. Understanding that boundary is what separates a useful deployment from a frustrating one.
What Still Requires a Human
GPT-4o excels at pattern recognition within a diff. It does not understand your team’s architectural history, the migration you decided against six months ago, or why a particular abstraction was intentionally kept leaky for operational reasons. Architectural review—decisions about service boundaries, data ownership, and long-term coupling—remains a human responsibility. The same applies to enforcing unwritten team conventions that live in Slack threads and post-mortems rather than linting rules.
Security review is a related case. The model flags obvious anti-patterns reliably, but threat modeling requires knowledge of your specific attack surface, trust boundaries, and regulatory context that no amount of prompt engineering encodes reliably.
The Three Highest-ROI Extensions
Organization-specific fine-tuning. Once you have six months of reviewer feedback captured in your feedback loop, you have a training dataset. Fine-tuning on accepted and rejected review comments produces a model that understands your conventions without relying on a growing system prompt that erodes context quality.
Multimodal input for UI review. GPT-4o’s vision capability lets you pass screenshot diffs or Figma exports alongside the code diff. A reviewer comment that references both the rendered output and the React component is substantially more actionable than text analysis alone.
Closing the feedback loop. Instrument your GitHub integration to record which AI comments engineers resolve versus dismiss. Route that signal back into a nightly prompt evaluation job. Prompt quality degrades silently without measurement; explicit feedback capture is what keeps the tool improving after launch.
💡 Pro Tip: Prioritize the feedback loop instrumentation before fine-tuning. You need the data before you can use it, and the logging infrastructure pays dividends immediately by surfacing which comment categories have the lowest acceptance rates.
The patterns in this guide give you a working foundation. The extensions above are where the tool evolves from a novelty into infrastructure your team relies on.
Key Takeaways
- Use GPT-4o’s structured output mode with a JSON schema to get typed, machine-parseable review comments instead of free-form text—this makes downstream routing and deduplication trivial.
- Design your prompt around a severity taxonomy (blocker, warning, suggestion) and filter on it in CI to control which findings block merges versus surface as advisory notes.
- Cache file-level review results keyed on content hash to eliminate redundant API calls on rebases, and track token usage per PR from the start to avoid runaway costs as adoption grows.
- Measure review signal quality by recording which AI comments get resolved versus dismissed—use this data to iteratively tighten your prompts and reduce noise over time.