Planning a Large Language Model for Static Detection of Runtime Errors in Code Snippets

Orca is a novel LLM-based approach that instructs an LLM to autonomously plan and execute a predictive traversal of a Control Flow Graph (CFG), enabling static detection of runtime errors in code snippets by simulating dynamic program behavior.

Problem Statement

LLMs excel at static code analysis but fundamentally struggle to reason about dynamic runtime behavior, state changes, and execution traces. This gap means they often fail to predict runtime errors before code is actually executed. Early detection of such errors in online code snippets is critical to prevent expensive downstream fixes when snippets are integrated into larger codebases.

Key Novelty

Autonomous CFG-guided planning: LLM is instructed to formulate and follow a plan for navigating a Control Flow Graph, effectively acting as a predictive interpreter for (in)complete code snippets.
Symbol-table-aware branching: The LLM pauses at control flow branch points to inspect variable state via symbol tables, minimizing cascading error propagation during simulated execution.
Single-prompt predictive execution: The entire predictive interpretation is accomplished within one LLM prompt (no step-by-step re-prompting), making the approach cost-efficient and practical at scale.

Evaluation Highlights

Orca improves over state-of-the-art approaches in predicting execution traces of code snippets, demonstrating better dynamic reasoning capability.
Orca outperforms existing methods in static detection of runtime errors in online code snippets, validating its effectiveness as a downstream software engineering task.

Breakthrough Assessment

7/10 Orca represents a significant methodological advance by bridging the gap between static LLM reasoning and dynamic program behavior through structured CFG-guided planning and symbol table tracking, all within a single prompt—a practically impactful and technically novel contribution to LLM-based program analysis.

Methodology

Step 1 – CFG Construction: Parse the input (possibly incomplete) code snippet and construct a Control Flow Graph representing all possible execution paths.
Step 2 – Predictive Execution Planning: Instruct the LLM via a single prompt to autonomously generate a plan for traversing the CFG, tracking variable states in symbol tables at each node, and pausing at branch points to evaluate conditions.
Step 3 – Runtime Error Detection: Use the simulated execution trace produced by the LLM's plan to statically identify potential runtime errors (e.g., null dereferences, out-of-bounds accesses, type errors) before actual execution.

System Components

Control Flow Graph (CFG) Builder

Parses code snippets (including incomplete ones) and constructs a CFG that captures all possible execution paths and branching logic.

Predictive Execution Planner

The core LLM module that receives the CFG and a single structured prompt instructing it to formulate a traversal plan, simulating a step-by-step interpreter.

Symbol Table Tracker

Maintains and updates variable value states at each CFG node during the LLM's simulated execution, particularly at branch points to ensure accurate path decisions.

Runtime Error Detector

Analyzes the predicted execution trace output from the planner to statically flag runtime errors and defects without needing to actually run the code.

Results

Task	State-of-the-Art Baseline	Orca (This Paper)	Delta
Execution Trace Prediction	Lower accuracy (prior LLM/static methods)	Higher accuracy with CFG-guided planning	Improved (quantitative details in full paper)
Static Runtime Error Detection	Lower F1/precision/recall (existing tools)	Higher detection rate with predictive execution	Improved (quantitative details in full paper)
Prompt Cost Efficiency	Multi-step / multi-prompt approaches	Single prompt for full predictive interpretation	Significant cost reduction

Key Takeaways

Structured planning over CFGs allows LLMs to reason about dynamic program behavior without actual execution—a practical technique applicable to code review, linting, and CI/CD pipelines.
Tracking symbol tables at branch points is a key design choice to prevent LLM reasoning errors from compounding; ML practitioners building code agents should incorporate explicit state management at control flow boundaries.
Achieving full predictive execution within a single LLM prompt is a critical engineering insight for cost-sensitive deployments—designing prompts that encode a plan rather than iterating step-by-step can dramatically reduce inference costs.

Abstract

Large Language Models (LLMs) have been excellent in generating and reasoning about source code and natural-language texts. They can recognize patterns, syntax, and semantics in code, making them effective in several software engineering tasks. However, they exhibit weaknesses in reasoning about the program execution. They primarily operate on static code representations, failing to capture the dynamic behavior and state changes that occur during program execution. In this paper, we advance the capabilities of LLMs in reasoning about dynamic program behaviors. We propose Orca, a novel approach that instructs an LLM to autonomously formulate a plan to navigate through a control flow graph (CFG) for predictive execution of (in)complete code snippets. It acts as a predictive interpreter to “execute” the code. In Orca, we guide the LLM to pause at the branching point, focusing on the state of the symbol tables for variables' values, thus minimizing error propagation in the LLM's computation. We instruct the LLM not to stop at each step in its execution plan, resulting the use of only one prompt for the entire predictive interpreter, thus much cost-saving. As a downstream task, we use Orca to statically identify any runtime errors for online code snippets. Early detection of runtime errors and defects in these snippets is crucial to prevent costly fixes later in the development cycle after they were adapted into a codebase. Our empirical evaluation showed that Orca is effective and improves over the state-of-the-art approaches in predicting the execution traces and in static detection of runtime errors.