Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking

In-Context Learning (ICL) via CoT prompting significantly enhances Reasoning LLMs (RLLMs) despite their innate Chain-of-Thought training, primarily by reducing 'overthinking' and controlling token/step distributions. One-shot CoT consistently outperforms few-shot CoT for RLLMs.

Problem Statement

Reasoning LLMs like DeepSeek-R1 are trained with extended thinking and self-correction, raising the question of whether additional CoT prompting is redundant or even harmful. A key limitation of current RLLMs is 'overthinking'—excessive reflection loops that waste compute at inference time without proportional accuracy gains. No prior work had systematically analyzed the interplay between ICL-based CoT prompting and the innate reasoning of RLLMs across model scales and task complexities.

Key Novelty

First comprehensive empirical study of Zero-shot and Few-shot CoT prompting effects on RLLMs (1.5B–32B parameters) across mathematical reasoning benchmarks
Discovery that CoT prompting reduces excessive reflection tokens by ~90% in some cases, effectively mitigating RLLM overthinking behavior
Attention logits analysis revealing that RLLMs overfit to reflection-related words, and that external CoT guidance alleviates this overfitting

Evaluation Highlights

CoT prompting yields substantial performance gains on complex mathematical tasks for large-capacity RLLMs, with minimal gains on simple tasks—opposite pattern observed for smaller models
One-shot CoT consistently outperforms few-shot CoT for RLLMs, and reduces excessive reflection steps by approximately 90% in certain settings

Breakthrough Assessment

6/10 This is a solid empirical contribution that challenges assumptions about prompting necessity for RLLMs and provides actionable insights on overthinking reduction, but it is primarily an analysis/prompting study rather than a new architecture or training paradigm.

Methodology

Evaluate RLLMs (1.5B–32B) on mathematical reasoning benchmarks under Zero-shot, Zero-shot CoT, One-shot CoT, and Few-shot CoT conditions to measure accuracy and token usage
Analyze the distribution of thinking tokens and reasoning steps across prompting conditions to quantify overthinking reduction and identify reflection-related inefficiencies
Perform attention logits analysis to diagnose RLLM overfitting to reflection-related tokens and demonstrate how CoT prompting redistributes attention to task-relevant content

System Components

Zero-shot CoT Prompting

Appending 'Let's think step by step' to the prompt without examples, tested as a lightweight intervention on RLLMs

Few-shot CoT Prompting

Providing multiple input-output reasoning chain examples in context; found to underperform one-shot CoT for RLLMs

One-shot CoT Prompting

Single demonstration of a reasoning chain; identified as the optimal ICL strategy for RLLMs, balancing guidance and flexibility

Thinking Token Distribution Analysis

Quantitative measurement of how prompting strategies shift the number of reasoning steps and reflection tokens generated by RLLMs

Attention Logits Analysis

Inspection of attention weights to detect RLLM overfitting to reflection-related keywords and measure mitigation via CoT guidance

Results

Condition	Baseline (No CoT)	With One-shot CoT	Delta
Large model on complex math tasks	Lower accuracy	Substantially higher accuracy	Significant gain
Small model on simple math tasks	Moderate accuracy	Improved accuracy	Moderate gain
Excessive reflection steps	High (~100%)	Reduced by ~90%	-90% overthinking
Few-shot CoT vs One-shot CoT	Few-shot CoT baseline	One-shot CoT superior	Consistent improvement

Key Takeaways

Do not skip CoT prompting for RLLMs—even models trained with extended thinking benefit from explicit CoT guidance, especially on complex tasks
Prefer one-shot CoT over few-shot CoT when prompting RLLMs, as more examples do not help and can hurt performance
CoT prompting is a practical, zero-cost tool to reduce RLLM inference costs by cutting excessive reflection loops by up to 90%, improving both efficiency and accuracy

Abstract

Recent advances in Large Language Models (LLMs) have introduced Reasoning Large Language Models (RLLMs), which employ extended thinking processes with reflection and self-correction capabilities, demonstrating the effectiveness of test-time scaling. RLLMs exhibit innate Chain-of-Thought (CoT) reasoning capability obtained from training, leading to a natural question:"Is CoT prompting, a popular In-Context Learning (ICL) method for chat LLMs, necessary to enhance the reasoning capability of RLLMs?"In this work, we present the first comprehensive analysis of the impacts of Zero-shot CoT and Few-shot CoT on RLLMs across mathematical reasoning tasks. We examine models ranging from 1.5B to 32B parameters, finding that contrary to concerns, CoT prompting significantly enhances RLLMs' performance in most scenarios. Our results reveal distinct patterns: large-capacity models show minimal improvement on simple tasks but substantial gains on complex problems, while smaller models exhibit the opposite behavior. Further analysis demonstrates that CoT prompting effectively controls the distribution of the numbers of thinking tokens and reasoning steps, reducing excessive reflections by approximately 90% in some cases. Moreover, attention logits analysis reveals the RLLMs' overfitting to reflection-related words, which is mitigated by external CoT guidance. Notably, our experiments indicate that for RLLMs, one-shot CoT consistently yields superior performance compared to Few-shot CoT approaches. Our findings provide important insights for optimizing RLLMs' performance through appropriate prompting strategies.