In-Context Reinforcement Learning for Tool Use in Large Language Models
Problem Statement
LLMs are limited by their internal knowledge on complex tasks, making tool augmentation (e.g., Python interpreters, search engines) highly valuable. Existing tool-use training pipelines require expensive SFT cold-starts with large labeled datasets before RL can be applied. This creates a significant data and annotation bottleneck that limits scalability and accessibility of tool-augmented LLMs.
Key Novelty
- RL-only training framework that bypasses SFT entirely, removing the dependency on large labeled datasets for tool-use initialization
- Curriculum-style in-context example scheduling: starts with few-shot examples in rollout prompts and progressively reduces them to zero-shot, bridging cold-start and autonomous tool use
- Demonstrates that in-context demonstrations during RL rollouts can substitute for SFT supervision, achieving state-of-the-art on reasoning and tool-use benchmarks
Evaluation Highlights
- ICRL achieves state-of-the-art performance across multiple reasoning and tool-use benchmarks compared to SFT+RL baselines
- The framework is shown to be data-efficient, requiring no labeled SFT data while matching or surpassing pipelines that do
Breakthrough Assessment
Methodology
- Initialize RL training without any SFT phase; during rollouts, inject few-shot in-context examples demonstrating tool invocation (e.g., calling a Python interpreter or search engine) directly into the prompt
- Apply standard RL optimization (e.g., PPO or GRPO) using task-level reward signals, allowing the model to learn tool-use behavior guided by the in-context demonstrations
- Gradually reduce the number of in-context examples as training progresses according to a curriculum schedule, eventually transitioning to zero-shot prompts so the model internalizes tool-use behavior autonomously
System Components
Injects few-shot tool-use demonstrations into RL rollout prompts to guide the model on how to invoke external tools without requiring SFT pretraining
A scheduling mechanism that progressively decreases the number of in-context examples during training, transitioning the model from guided to fully autonomous tool use
The core reinforcement learning loop (e.g., PPO/GRPO) that trains the model using task-level rewards, leveraging the in-context guidance to bootstrap tool-use policy learning
Interfaces to external tools such as Python interpreters for mathematical reasoning and search engines for factual retrieval, invoked by the model during rollouts
Results
| Benchmark | SFT+RL Baseline | ICRL (This Paper) | Delta |
|---|---|---|---|
| Reasoning Benchmarks (avg) | Strong baseline | State-of-the-art | Positive improvement |
| Tool-Use Benchmarks (avg) | SFT-dependent SOTA | Matches/exceeds | Comparable or better |
| Labeled SFT Data Required | Large dataset needed | Zero | 100% reduction in data cost |
Key Takeaways
- Practitioners can train tool-augmented LLMs without collecting or synthesizing labeled SFT demonstrations, significantly reducing data pipeline costs and enabling faster iteration
- The in-context curriculum approach is a reusable design pattern: starting RL with guided demonstrations and annealing to zero-shot can likely generalize to other LLM skill acquisition tasks beyond tool use
- ICRL suggests that the boundary between inference-time in-context learning and training-time RL can be productively blurred, opening avenues for data-efficient LLM training in low-resource settings
Abstract
While large language models (LLMs) exhibit strong reasoning abilities, their performance on complex tasks is often constrained by the limitations of their internal knowledge. A compelling approach to overcome this challenge is to augment these models with external tools -- such as Python interpreters for mathematical computations or search engines for retrieving factual information. However, enabling models to use these tools effectively remains a significant challenge. Existing methods typically rely on cold-start pipelines that begin with supervised fine-tuning (SFT), followed by reinforcement learning (RL). These approaches often require substantial amounts of labeled data for SFT, which is expensive to annotate or synthesize. In this work, we propose In-Context Reinforcement Learning (ICRL), an RL-only framework that eliminates the need for SFT by leveraging few-shot prompting during the rollout stage of RL. Specifically, ICRL introduces in-context examples within the rollout prompts to teach the model how to invoke external tools. Furthermore, as training progresses, the number of in-context examples is gradually reduced, eventually reaching a zero-shot setting where the model learns to call tools independently. We conduct extensive experiments across a range of reasoning and tool-use benchmarks. Results show that ICRL achieves state-of-the-art performance, demonstrating its effectiveness as a scalable, data-efficient alternative to traditional SFT-based pipelines.