← Back to Papers

Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking

Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Lizhe Chen, Baolong Bi, Xueqi Cheng
arXiv.org | 2025
In-Context Learning (ICL) via CoT prompting significantly enhances Reasoning LLMs (RLLMs) despite their innate Chain-of-Thought training, primarily by reducing 'overthinking' and controlling token/step distributions. One-shot CoT consistently outperforms few-shot CoT for RLLMs.

Problem Statement

Reasoning LLMs like DeepSeek-R1 are trained with extended thinking and self-correction, raising the question of whether additional CoT prompting is redundant or even harmful. A key limitation of current RLLMs is 'overthinking'—excessive reflection loops that waste compute at inference time without proportional accuracy gains. No prior work had systematically analyzed the interplay between ICL-based CoT prompting and the innate reasoning of RLLMs across model scales and task complexities.

Key Novelty

  • First comprehensive empirical study of Zero-shot and Few-shot CoT prompting effects on RLLMs (1.5B–32B parameters) across mathematical reasoning benchmarks
  • Discovery that CoT prompting reduces excessive reflection tokens by ~90% in some cases, effectively mitigating RLLM overthinking behavior
  • Attention logits analysis revealing that RLLMs overfit to reflection-related words, and that external CoT guidance alleviates this overfitting

Evaluation Highlights

  • CoT prompting yields substantial performance gains on complex mathematical tasks for large-capacity RLLMs, with minimal gains on simple tasks—opposite pattern observed for smaller models
  • One-shot CoT consistently outperforms few-shot CoT for RLLMs, and reduces excessive reflection steps by approximately 90% in certain settings

Breakthrough Assessment

6/10 This is a solid empirical contribution that challenges assumptions about prompting necessity for RLLMs and provides actionable insights on overthinking reduction, but it is primarily an analysis/prompting study rather than a new architecture or training paradigm.

Methodology

  1. Evaluate RLLMs (1.5B–32B) on mathematical reasoning benchmarks under Zero-shot, Zero-shot CoT, One-shot CoT, and Few-shot CoT conditions to measure accuracy and token usage
  2. Analyze the distribution of thinking tokens and reasoning steps across prompting conditions to quantify overthinking reduction and identify reflection-related inefficiencies
  3. Perform attention logits analysis to diagnose RLLM overfitting to reflection-related tokens and demonstrate how CoT prompting redistributes attention to task-relevant content

System Components

Zero-shot CoT Prompting

Appending 'Let's think step by step' to the prompt without examples, tested as a lightweight intervention on RLLMs

Few-shot CoT Prompting

Providing multiple input-output reasoning chain examples in context; found to underperform one-shot CoT for RLLMs

One-shot CoT Prompting

Single demonstration of a reasoning chain; identified as the optimal ICL strategy for RLLMs, balancing guidance and flexibility

Thinking Token Distribution Analysis

Quantitative measurement of how prompting strategies shift the number of reasoning steps and reflection tokens generated by RLLMs

Attention Logits Analysis

Inspection of attention weights to detect RLLM overfitting to reflection-related keywords and measure mitigation via CoT guidance

Results

Condition Baseline (No CoT) With One-shot CoT Delta
Large model on complex math tasks Lower accuracy Substantially higher accuracy Significant gain
Small model on simple math tasks Moderate accuracy Improved accuracy Moderate gain
Excessive reflection steps High (~100%) Reduced by ~90% -90% overthinking
Few-shot CoT vs One-shot CoT Few-shot CoT baseline One-shot CoT superior Consistent improvement

Key Takeaways

  • Do not skip CoT prompting for RLLMs—even models trained with extended thinking benefit from explicit CoT guidance, especially on complex tasks
  • Prefer one-shot CoT over few-shot CoT when prompting RLLMs, as more examples do not help and can hurt performance
  • CoT prompting is a practical, zero-cost tool to reduce RLLM inference costs by cutting excessive reflection loops by up to 90%, improving both efficiency and accuracy

Abstract

Recent advances in Large Language Models (LLMs) have introduced Reasoning Large Language Models (RLLMs), which employ extended thinking processes with reflection and self-correction capabilities, demonstrating the effectiveness of test-time scaling. RLLMs exhibit innate Chain-of-Thought (CoT) reasoning capability obtained from training, leading to a natural question:"Is CoT prompting, a popular In-Context Learning (ICL) method for chat LLMs, necessary to enhance the reasoning capability of RLLMs?"In this work, we present the first comprehensive analysis of the impacts of Zero-shot CoT and Few-shot CoT on RLLMs across mathematical reasoning tasks. We examine models ranging from 1.5B to 32B parameters, finding that contrary to concerns, CoT prompting significantly enhances RLLMs' performance in most scenarios. Our results reveal distinct patterns: large-capacity models show minimal improvement on simple tasks but substantial gains on complex problems, while smaller models exhibit the opposite behavior. Further analysis demonstrates that CoT prompting effectively controls the distribution of the numbers of thinking tokens and reasoning steps, reducing excessive reflections by approximately 90% in some cases. Moreover, attention logits analysis reveals the RLLMs' overfitting to reflection-related words, which is mitigated by external CoT guidance. Notably, our experiments indicate that for RLLMs, one-shot CoT consistently yields superior performance compared to Few-shot CoT approaches. Our findings provide important insights for optimizing RLLMs' performance through appropriate prompting strategies.

Generated on 2026-03-02 using Claude