TIMES OF TECH

Fix Critical Bugs with Outside-Diff Impact Slicing

Editor’s note: David Loker is a speaker for ODSC AI West this October 28th-30th. Check out his talk, Context Engineering for AI Code Reviews with MCP, LLMs, and Open-Source DevOps Tooling, there!

TL;DR: Your AI can generate a React component in seconds but ask it to fix the bug in a 30-line PR and it hallucinates issues that don’t exist. The problem isn’t the model—it’s the context, or the lack thereof. This post shares a compact technique called Outside-Diff Impact Slicing that looks beyond the patch to catch bugs at caller/callee boundaries. You’ll run one Python script using OpenAI’s Responses API with GPT-5-mini and get structured, evidence-backed findings ready to paste into a PR.

Note: This works best for focused PRs (10-50 changed lines). For larger changes, see “Try this next” at the end.

In-person conference | October 28th-30th, 2025 | San Francisco, CA

The AI Expo Showcase at ODSC AI West 2025 (San Francisco, Oct 28–29) is where AI practitioners, innovators, and leaders come together to explore what’s next in AI.

🔹 40+ exhibitors including Dell Technologies, Postman, Elastic, NVIDIA, Microsoft Azure, Arize, and more
🔹 30 free sessions & 12 experience events – from main stage keynotes to panels
🔹 AI Robotics Hackathon – Top team wins a Unitree Go2 Pro Robot Dog
🔹 Networking opportunities with AI and data science professionals
🔹 Access to AI Startup Showcase, book signings, and community-building meetups

 

The real problem: diffs hide the contracts

Here’s the thing about code review: the diff-view lies to you. It shows what changed, but not what those changes might break. For example, when you add a parameter to a function, the diff won’t show you the twelve call sites that are now passing the wrong number of arguments. Or when you change a return type, the diff won’t highlight the upstream code expecting the old format.

Most AI code review tools make the same mistake of sending the LLM a patch and asking it to “find bugs.” But the most critical bugs aren’t in the patch. They’re at the boundaries between changed code and unchanged code. That’s where contracts get violated.

Outside-Diff Impact Slicing fixes this by asking a simple question: “What’s one hop away from this change?” Specifically:

  • Callers: What code calls the functions/classes I just modified?
  • Callees: What functions/classes does my changed code call?

These boundaries are where the interesting and critical bugs live, especially those with the highest potential to cause downtime. One important refinement: extract calls from the changed lines themselves, not from the entire changed file. If line 55 calls DatabaseConnection, you care about that contract. You don’t care about the unrelated validate_input call on line 200.

The technique: six focused steps

The full script is ~400 lines (available on GitHub), but the core technique breaks into six pieces. I’ll show you the interesting parts.

Step 1: Parse the diff for exact line numbers

You need surgical precision in this step. Instead of simply saying “this file changed”, you need more granularity like “lines 55-57 in reporting/recreate.py changed.”

def changed_lines(repo=".") -> Dict[str, Set[int]]:
    """Extract changed line numbers from git diff."""
    diff = subprocess.check_output(
        ["git", "-C", repo, "diff", "--unified=0", "--no-color", "HEAD~1"]
    ).decode()
    current = None
    changes: Dict[str, Set[int]] = {}
    for line in diff.splitlines():
        if line.startswith("+++ b/"):
            current = line[6:]  # Extract filename
        elif line.startswith("@@") and current:
            # Parse hunk header: @@ -10,3 +27,8 @@
            # We want the "+27,8" part (new file line numbers)
            parts = [p for p in line.split() if p.startswith("+")]
            if not parts:
                continue
            hunk = parts[0]  # "+27,8" or "+42"
            start = int(hunk.split(",")[0][1:])
            count = int(hunk.split(",")[1]) if "," in hunk else 1
            changes.setdefault(current, set()).update(range(start, start + count))
    return changes

Why it matters: Line-level granularity lets you focus your analysis. If only line 55 changed, you don’t care about the function call on line 200.

Step 2: Extract calls from changed lines only

Focus matters. Instead of grabbing all function calls in a changed file, only extract calls from the specific lines that changed.

def calls_in_lines(path: str, lines: Set[int]) -> Set[str]:
    """Extract function/class calls within specific line numbers."""
    src = pathlib.Path(path).read_text(encoding="utf-8")
    tree = ast.parse(src)

    calls = set()
    for node in ast.walk(tree):
        if isinstance(node, ast.Call) and isinstance(node.func, ast.Name):
            # Only grab calls that occur on changed lines
            if hasattr(node, 'lineno') and node.lineno in lines:
                calls.add(node.func.id)
    return calls

Why it matters: If a file has 200 lines with 50 function calls, but only 3 lines changed with 1 call, you analyze 1 contract instead of 50 resulting in less noise, more signal.

Step 3: Identify signature changes for caller analysis

For the caller direction, you only care about functions/classes whose signatures changed—not every function that happens to contain a changed line.

def symbols_with_signature_changes(path: str, lines: Set[int]) -> Set[str]:
    """Find functions/classes whose SIGNATURES were changed (def line itself)."""
    src = pathlib.Path(path).read_text(encoding="utf-8")
    tree = ast.parse(src)
    changed_signatures = set()
    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            # Check if the definition line itself was changed
            if node.lineno in lines:
                changed_signatures.add(node.name)
            # For classes, also check if __init__ signature changed
            if isinstance(node, ast.ClassDef):
                for item in node.body:
                    if isinstance(item, (ast.FunctionDef, ast.AsyncFunctionDef)) and item.name == "__init__":
                        if item.lineno in lines:
                            changed_signatures.add(node.name)
    return changed_signatures

Why this matters: If you change line 100 inside process_data() (not the def line), you don’t need to check all callers of process_data()as the function signature didn’t change. But if you change line 18 from def validate_email(email): to def validate_email(email, strict=True):, you DO need to check callers.

Step 4: Build the one-hop slice

Now we can use a call graph to find the impact files in both directions:

  • Callees: Files defining what your changed lines call (from calls_in_lines)
  • Callers: Files calling functions whose signatures you changed (from symbols_with_signature_changes)

The full implementation builds a simple call graph (callgraph_for_files) tracking which files define/call which symbols, then uses it to find impact files. See GitHub for the complete one_hop_slice() function.

In-person conference | October 28th-30th, 2025 | San Francisco, CA

ODSC West is back—bringing together the brightest minds in AI to deliver cutting-edge insights. Train with experts in:

LLMs & GenAI | Agentic AI & MLOps | Machine Learning & Deep Learning | NLP | Robotics | and More

 

Step 5: Structured markdown format with XML-style tags

Here’s a surprise: markdown beats JSON for LLM input. It’s clearer, more token-efficient, and easier for the model to parse.

The context is structured into three sections:

  1. Git Diff: wrapped in tags showing what changed
  2. Changed Code: each file in tags with code snippets
  3. Impact Code: split into and subsections, each file tagged with type=”impact”

Why this format works: The XML-style tags let the LLM clearly distinguish “changed code” from “reference contracts.” The type=”changed” vs type=”impact” distinction is critical for preventing hallucinations where the model cites the wrong file. Markdown with code blocks is also more token-efficient than nested JSON structures.

Step 6: The prompt that makes it work

The prompt has one critical job: make it crystal clear that findings should reference changed files, not impact files. Early versions of our technique kept citing the impact file (the contract definition) instead of the buggy changed code. Here’s the prompt that solved it:

prompt = (
    "You are a senior code reviewer analyzing a PR for bugs. "
    "You will receive structured markdown with THREE sections:\\\\n\\\\n"
    "1. Git Diff: Shows what changed (in  tags)\\\\n"
    "2. Changed Code: Snippets from modified files (type=\\\\"changed\\\\")\\\\n"
    "3. Impact Code: Both CALLEES (definitions the changed code calls) and CALLERS "
    "(code that calls the changed symbols). These show contracts/signatures and usage patterns.\\\\n\\\\n"
    "YOUR TASK: Find real bugs in the CHANGED CODE. Look for:\\\\n"
    "- CONTRACT MISMATCHES: Wrong parameter count, signature changes\\\\n"
    "- LOGIC ERRORS: Off-by-one, incorrect conditionals, missing edge cases\\\\n"
    "- CONCURRENCY: Race conditions, missing synchronization\\\\n"
    "- RESOURCE MANAGEMENT: Leaks, missing cleanup\\\\n"
    "- ERROR HANDLING: Unhandled exceptions, silent failures\\\\n"
    "- SECURITY: Injection risks, missing validation\\\\n\\\\n"
    "CRITICAL: Your findings MUST reference the CHANGED files (type=\\\\"changed\\\\"), "
    "NOT the impact files. Impact files show contracts for reference only.\\\\n\\\\n"
    "Focus on real bugs, not style. If nothing critical, return empty bugs array.\\\\n\\\\n"
    + review_context
)

Why this works:

  1. The “CRITICAL” instruction: Explicitly stating that findings must reference changed files (not impact files) resulted in wrong-file citations coming down from ~40% to nearly zero. Without this instruction the model naturally gravitates toward citing the contracts it sees in the impact section.
  2. Concrete bug categories: Listing specific types (contract-mismatch, logic-error, etc.) guides the model toward real issues rather than style complaints or vague “could be better” suggestions.
  3. Three-section structure: By clearly labeling the diff, changed code, and impact code with XML-style tags, the model can easily distinguish “what changed” from “what the changes interact with” helping move the review focus from change to impact of that change.

Implementation note: The full script uses OpenAI’s Responses API with GPT-5-mini and structured outputs to guarantee JSON schema compliance. This ensures you get consistent, parseable results every time. See the full code on GitHub for API details.

Does it actually work? A real example

I tested this on a PR where someone refactored a database connection helper. Here’s what the script found:

The changed code (src/workers/data_sync.py, line 73):

# Refactored to use new connection pooling
conn = DatabaseConnection(config["db_host"], config["db_port"], timeout=30)

The impact code (src/db/connection.py, the contract):

class DatabaseConnection:
    def __init__(self, connection_string: str, pool_size: int = 10):
        """Initialize connection from a connection string like 'host:port'."""
        self.connection_string = connection_string
        self.pool_size = pool_size
        # ...

The finding:

{
  "changed_file": "src/workers/data_sync.py",
  "changed_lines": "73",
  "bug_category": "contract-mismatch",
  "summary": "DatabaseConnection called with wrong parameter types and count",
  "comment": "The changed code calls DatabaseConnection(config['db_host'], config['db_port'], timeout=30) with three arguments (two positional strings and a keyword arg). The impact code shows DatabaseConnection.__init__ expects a single connection_string parameter (format 'host:port') and an optional pool_size integer. This will raise TypeError at runtime. The 'timeout' parameter doesn't exist in the signature.",
  "diff_fix_suggestion": "--- a/src/workers/data_sync.py\\\\n+++ b/src/workers/data_sync.py\\\\n@@ -73,1 +73,1 @@\\\\n-conn = DatabaseConnection(config['db_host'], config['db_port'], timeout=30)\\\\n+conn = DatabaseConnection(f\\\\"{config['db_host']}:{config['db_port']}\\\\")"
}

Why a human might miss this: The developer saw “DatabaseConnection” in the old code, knew it changed, but didn’t look up the new signature in src/db/connection.py. When reviewing the diff, you see what looks like reasonable arguments (host, port, timeout) and your brain doesn’t flag it. The contract violation is invisible until you cross-reference the actual definition, which is exactly what Outside-Diff Impact Slicing automates.

The technique also works in the other direction (finding bugs in callers when you change a function signature):

The changed code (src/utils/validation.py, line 18):

def validate_email(email: str, domain_whitelist: List[str]):
    """Validate email format. Now requires a domain whitelist for security."""
    # ... implementation checks if email domain is in whitelist

The impact code (caller in src/api/auth.py, line 95):

# This caller wasn't updated when validate_email signature changed
if validate_email(user_input):
    send_confirmation(user_input)

The finding:

{
  "changed_file": "src/utils/validation.py",
  "changed_lines": "18",
  "bug_category": "contract-mismatch",
  "summary": "validate_email signature changed to require domain_whitelist but caller missing it",
  "comment": "The changed code modified validate_email to require a second parameter 'domain_whitelist' (a required List[str]). However, the impact code shows a caller in src/api/auth.py:95 that only passes one argument: validate_email(user_input). This will raise TypeError at runtime: validate_email() missing 1 required positional argument: 'domain_whitelist'.",
  "diff_fix_suggestion": "--- a/src/api/auth.py\\\\n+++ b/src/api/auth.py\\\\n@@ -95,1 +95,1 @@\\\\n-if validate_email(user_input):\\\\n+if validate_email(user_input, ALLOWED_EMAIL_DOMAINS):"
}

This demonstrates both directions: callees (what changed code calls) and callers (what calls the changed code).

Why this technique works

Three ingredients make this effective:

  1. Graph awareness beyond the diff: instead of focusing on reading the patch, you are checking contracts at the boundaries. That’s where critical integration bugs live.
  2. Line-level precision: extract calls only from changed lines, not entire files. Simple refinement, significant noise reduction.
  3. Structured input + explicit constraints: the markdown format with XML tags gives the LLM clear structure. The “CRITICAL” instruction about changed vs. impact files prevents the most common hallucination (citing the wrong file).

Limitations: This works best for small, focused PRs (10-50 lines). Larger PRs blow out the context window. For larger PRs you need token budgeting to rank snippets by relevance, clip low-priority code.

In-person conference | October 28th-30th, 2025 | San Francisco, CA

The AI Expo Showcase at ODSC AI West 2025 (San Francisco, Oct 28–29) is where AI practitioners, innovators, and leaders come together to explore what’s next in AI.

🔹 40+ exhibitors including Dell Technologies, Postman, Elastic, NVIDIA, Microsoft Azure, Arize, and more
🔹 30 free sessions & 12 experience events – from main stage keynotes to panels
🔹 AI Robotics Hackathon – Top team wins a Unitree Go2 Pro Robot Dog
🔹 Networking opportunities with AI and data science professionals
🔹 Access to AI Startup Showcase, book signings, and community-building meetups

 

Try this next

Once you have the basic technique working, here are extensions worth exploring:

  1. Richer code graphs with Tree-sitter: the script uses Python’s ast module, which only works for Python. Swap in Tree-sitter to handle JavaScript, TypeScript, Go, Rust or any language with a grammar. You’ll get more accurate definitions and cross-file references.
  2. Add linter output to the context: run ruff, mypy, or eslint on your impact files and pipe their findings (ideally as JSON or SARIF) into the review context. Static analysis catches different bugs than LLMs—combine them.
  3. Token budgeting for large PRs: for PRs with 200+ changed lines, you’ll blow your context window. Build a ranking system: score snippets by (distance from changed lines × call frequency × past bug density), then keep only the top N. CodeRabbit does this and it is the difference between “works on toy PRs” and “works in production.”
  4. MCP integration for org-specific context: if you’re using Model Context Protocol servers, you can fetch ticket descriptions, CI logs, feature requirements docs, architectural diagrams, or internal style guides and append them to your review context. The LLM can then check “does this change actually fix JIRA-1234?” or “does this follow our error handling conventions?”
  5. Post findings as PR comments: use the GitHub API (gh pr comment) or your platform’s API to post findings directly to the PR. Include the changed_lines in your API call to anchor comments inline. Now your review bot feels like a real teammate that finds bugs you would otherwise miss.
  6. Measure and iterate: track precision (what % of findings are real bugs?) and recall (what % of real bugs did it find?) over time. If a category has high false positives, tighten the prompt. If it misses obvious bugs, add targeted checks.

Run it yourself

Prerequisites: Python 3.10+, pip install openai, at least one commit to diff against.

Full script: github.com/coderabbitai/odsc-west-2025/review_demo.py (includes helpers omitted here for brevity)

Quick start:

git clone 
cd your-project-with-a-diff
python /path/to/review_demo.py
# Paste your OpenAI API key when prompted

The script outputs JSON findings you can paste into a PR comment or pipe to another tool.

See it live at ODSC West 2025

I’ll be presenting “Context Engineering for AI Code Reviews with MCP, LLMs, and Open-Source DevOps Tooling” at ODSC AI West. The talk covers the full system: graph awareness, multi-linter evidence, repo history, agent guidelines, custom rules, and MCP integration for org-specific context. Hope to see you there!

About the author

David Loker is the Director of AI at CodeRabbit, where he leads development of agentic AI systems for code review and developer workflows. He has published at NeurIPS, ICML, and AAAI, and has been building large-scale AI systems since 2007.



Source link

For more info visit at Times Of Tech

Share this post on

Facebook
Twitter
LinkedIn

Leave a Reply

Your email address will not be published. Required fields are marked *