Not the Example, but the Process: How Self-Generated Examples Enhance LLM Reasoning

The performance gains from self-generated few-shot examples in LLM reasoning stem from the act of example creation itself, not the examples produced, meaning the generative process—not the artifacts—is the key driver of improvement.

Problem Statement

While self-generated few-shot prompting has shown promise for improving LLM reasoning, the mechanism behind these gains is poorly understood, making it difficult to design or apply the technique effectively. Prior work conflated two distinct factors—the process of generation and the reuse of generated examples—leaving practitioners without clear guidance. This ambiguity has led to suboptimal prompting strategies and missed opportunities for efficiency.

Key Novelty

Introduces a principled three-way comparison framework (Zero-shot, Integrated, and Decoupled prompting) to isolate whether performance gains come from example creation or example reuse
Demonstrates empirically across five LLM architectures that Integrated prompting consistently outperforms Decoupled prompting, showing the process—not the artifact—matters
Provides attention analysis evidence showing distinct attention patterns between Integrated and Decoupled prompting, offering mechanistic insight into why the generative process is the active ingredient

Evaluation Highlights

Integrated prompting (create and solve in one prompt) consistently outperforms both Zero-shot and Decoupled prompting across reasoning-intensive tasks and five diverse LLM architectures
Decoupled prompting (reusing self-generated examples without the creation context) offers only marginal gains over Zero-shot baseline, suggesting the examples themselves contribute little independent value

Signal Assessment

6/10 This is a solid, well-controlled mechanistic study that reframes understanding of self-generated prompting, with practical implications for prompt design; however, it is primarily an analytical contribution rather than a new capability or architecture, limiting its transformative impact.

Methodology

Define three prompting conditions: Zero-shot (direct answering), Integrated (LLM generates example problems and solves the target within one unified prompt), and Decoupled (LLM generates examples first, then those examples are injected as separate in-context demonstrations for solving the target)
Evaluate all three conditions on reasoning-intensive benchmarks across five widely used LLM architectures to ensure findings are model-agnostic and generalizable
Conduct attention pattern analysis comparing Integrated vs. Decoupled prompting to identify mechanistic differences in how the model attends to context under each condition, providing interpretability evidence for the behavioral findings

System Components

Zero-shot Prompting

Baseline condition where the LLM directly answers the target question without any in-context examples

Integrated Prompting

The LLM creates example problems and solves the target question within a single unified prompt, preserving the generative context throughout

Decoupled Prompting

Self-generated examples are extracted and reused as traditional in-context demonstrations, but the creation context is stripped away before the target question is posed

Attention Analysis

Mechanistic interpretability tool used to compare how attention heads and patterns differ between Integrated and Decoupled conditions, revealing why process matters more than examples

Results

Condition	vs. Zero-shot	vs. Decoupled	Key Finding
Integrated Prompting	Significant improvement	Consistently better	Best overall; process is the active ingredient
Decoupled Prompting	Marginal improvement	Baseline comparison	Examples alone add little value
Zero-shot Prompting	Baseline	Worse than Decoupled	No self-generation benefit

Key Takeaways

When using self-generated few-shot prompting, keep the example creation and target solving in a single unified prompt (Integrated) rather than separating generation and inference into two stages—the creation context is what drives gains
Reusing self-generated examples as isolated in-context demonstrations (akin to traditional few-shot ICL) provides minimal benefit; practitioners should not invest in pipelines that extract and recycle generated examples without their originating context
Attention analysis is a viable diagnostic tool for understanding why different prompting strategies behave differently, and future prompt design should consider preserving task-relevant generative context rather than treating prompts as static example containers

Abstract

Recent studies have shown that Large Language Models (LLMs) can improve their reasoning performance through self-generated few-shot examples, achieving results comparable to manually curated in-context examples. However, the underlying mechanism behind these gains remains unclear, making it hard to decide when and how to apply the technique effectively. In this work, we argue that the key benefit arises not from the generated examples themselves but from the act of creating them. To validate this, on reasoning-intensive tasks across diverse LLM architectures, we systematically evaluate three prompting strategies for in-context learning: (1) Zero-shot prompting; (2) Integrated prompting, where LLMs create and solve problems within a single, unified prompt; and (3) Decoupled prompting, where self-generated examples are reused as in-context examples, but the context of their creation itself is excluded. We conduct experiments across five widely used model architectures, demonstrating that Integrated prompting consistently outperforms both Zero-shot and Decoupled prompting. In contrast, Decoupled prompting offers only marginal gains over Zero-shot. Further, for a more in-depth analysis, we conduct an attention analysis and observe significant differences in attention patterns between Integrated and Decoupled prompting. These findings suggest that the advantage of self-generation prompting comes from the process of problem creation, not the examples themselves, providing valuable insights for designing more effective prompting strategies.