Planning a Large Language Model for Static Detection of Runtime Errors in Code Snippets
Problem Statement
LLMs excel at static code analysis but fundamentally struggle to reason about dynamic runtime behavior, state changes, and execution traces. This gap means they often fail to predict runtime errors before code is actually executed. Early detection of such errors in online code snippets is critical to prevent expensive downstream fixes when snippets are integrated into larger codebases.
Key Novelty
- Autonomous CFG-guided planning: LLM is instructed to formulate and follow a plan for navigating a Control Flow Graph, effectively acting as a predictive interpreter for (in)complete code snippets.
- Symbol-table-aware branching: The LLM pauses at control flow branch points to inspect variable state via symbol tables, minimizing cascading error propagation during simulated execution.
- Single-prompt predictive execution: The entire predictive interpretation is accomplished within one LLM prompt (no step-by-step re-prompting), making the approach cost-efficient and practical at scale.
Evaluation Highlights
- Orca improves over state-of-the-art approaches in predicting execution traces of code snippets, demonstrating better dynamic reasoning capability.
- Orca outperforms existing methods in static detection of runtime errors in online code snippets, validating its effectiveness as a downstream software engineering task.
Breakthrough Assessment
Methodology
- Step 1 – CFG Construction: Parse the input (possibly incomplete) code snippet and construct a Control Flow Graph representing all possible execution paths.
- Step 2 – Predictive Execution Planning: Instruct the LLM via a single prompt to autonomously generate a plan for traversing the CFG, tracking variable states in symbol tables at each node, and pausing at branch points to evaluate conditions.
- Step 3 – Runtime Error Detection: Use the simulated execution trace produced by the LLM's plan to statically identify potential runtime errors (e.g., null dereferences, out-of-bounds accesses, type errors) before actual execution.
System Components
Parses code snippets (including incomplete ones) and constructs a CFG that captures all possible execution paths and branching logic.
The core LLM module that receives the CFG and a single structured prompt instructing it to formulate a traversal plan, simulating a step-by-step interpreter.
Maintains and updates variable value states at each CFG node during the LLM's simulated execution, particularly at branch points to ensure accurate path decisions.
Analyzes the predicted execution trace output from the planner to statically flag runtime errors and defects without needing to actually run the code.
Results
| Task | State-of-the-Art Baseline | Orca (This Paper) | Delta |
|---|---|---|---|
| Execution Trace Prediction | Lower accuracy (prior LLM/static methods) | Higher accuracy with CFG-guided planning | Improved (quantitative details in full paper) |
| Static Runtime Error Detection | Lower F1/precision/recall (existing tools) | Higher detection rate with predictive execution | Improved (quantitative details in full paper) |
| Prompt Cost Efficiency | Multi-step / multi-prompt approaches | Single prompt for full predictive interpretation | Significant cost reduction |
Key Takeaways
- Structured planning over CFGs allows LLMs to reason about dynamic program behavior without actual execution—a practical technique applicable to code review, linting, and CI/CD pipelines.
- Tracking symbol tables at branch points is a key design choice to prevent LLM reasoning errors from compounding; ML practitioners building code agents should incorporate explicit state management at control flow boundaries.
- Achieving full predictive execution within a single LLM prompt is a critical engineering insight for cost-sensitive deployments—designing prompts that encode a plan rather than iterating step-by-step can dramatically reduce inference costs.
Abstract
Large Language Models (LLMs) have been excellent in generating and reasoning about source code and natural-language texts. They can recognize patterns, syntax, and semantics in code, making them effective in several software engineering tasks. However, they exhibit weaknesses in reasoning about the program execution. They primarily operate on static code representations, failing to capture the dynamic behavior and state changes that occur during program execution. In this paper, we advance the capabilities of LLMs in reasoning about dynamic program behaviors. We propose Orca, a novel approach that instructs an LLM to autonomously formulate a plan to navigate through a control flow graph (CFG) for predictive execution of (in)complete code snippets. It acts as a predictive interpreter to “execute” the code. In Orca, we guide the LLM to pause at the branching point, focusing on the state of the symbol tables for variables' values, thus minimizing error propagation in the LLM's computation. We instruct the LLM not to stop at each step in its execution plan, resulting the use of only one prompt for the entire predictive interpreter, thus much cost-saving. As a downstream task, we use Orca to statically identify any runtime errors for online code snippets. Early detection of runtime errors and defects in these snippets is crucial to prevent costly fixes later in the development cycle after they were adapted into a codebase. Our empirical evaluation showed that Orca is effective and improves over the state-of-the-art approaches in predicting the execution traces and in static detection of runtime errors.