In-Context Reinforcement Learning for Tool Use in Large Language Models

In-Context Reinforcement Learning (ICRL) is an RL-only framework for teaching LLMs to use external tools, eliminating the need for supervised fine-tuning by leveraging few-shot in-context examples during RL rollouts that are gradually reduced to zero-shot settings.

Problem Statement

LLMs are limited by their internal knowledge on complex tasks, making tool augmentation (e.g., Python interpreters, search engines) highly valuable. Existing tool-use training pipelines require expensive SFT cold-starts with large labeled datasets before RL can be applied. This creates a significant data and annotation bottleneck that limits scalability and accessibility of tool-augmented LLMs.

Key Novelty

RL-only training framework that bypasses SFT entirely, removing the dependency on large labeled datasets for tool-use initialization
Curriculum-style in-context example scheduling: starts with few-shot examples in rollout prompts and progressively reduces them to zero-shot, bridging cold-start and autonomous tool use
Demonstrates that in-context demonstrations during RL rollouts can substitute for SFT supervision, achieving state-of-the-art on reasoning and tool-use benchmarks

Evaluation Highlights

ICRL achieves state-of-the-art performance across multiple reasoning and tool-use benchmarks compared to SFT+RL baselines
The framework is shown to be data-efficient, requiring no labeled SFT data while matching or surpassing pipelines that do

Breakthrough Assessment

7/10 ICRL presents a genuinely novel and practical solution to the cold-start problem in LLM tool-use training, eliminating costly SFT dependencies. While the components (RL, in-context learning, curriculum scheduling) are individually known, their combination in this training paradigm is a meaningful advance with broad applicability.

Methodology

Initialize RL training without any SFT phase; during rollouts, inject few-shot in-context examples demonstrating tool invocation (e.g., calling a Python interpreter or search engine) directly into the prompt
Apply standard RL optimization (e.g., PPO or GRPO) using task-level reward signals, allowing the model to learn tool-use behavior guided by the in-context demonstrations
Gradually reduce the number of in-context examples as training progresses according to a curriculum schedule, eventually transitioning to zero-shot prompts so the model internalizes tool-use behavior autonomously

System Components

In-Context Rollout Prompting

Injects few-shot tool-use demonstrations into RL rollout prompts to guide the model on how to invoke external tools without requiring SFT pretraining

Curriculum Example Reduction

A scheduling mechanism that progressively decreases the number of in-context examples during training, transitioning the model from guided to fully autonomous tool use

RL Optimization Engine

The core reinforcement learning loop (e.g., PPO/GRPO) that trains the model using task-level rewards, leveraging the in-context guidance to bootstrap tool-use policy learning

External Tool Integration

Interfaces to external tools such as Python interpreters for mathematical reasoning and search engines for factual retrieval, invoked by the model during rollouts

Results

Benchmark	SFT+RL Baseline	ICRL (This Paper)	Delta
Reasoning Benchmarks (avg)	Strong baseline	State-of-the-art	Positive improvement
Tool-Use Benchmarks (avg)	SFT-dependent SOTA	Matches/exceeds	Comparable or better
Labeled SFT Data Required	Large dataset needed	Zero	100% reduction in data cost

Key Takeaways

Practitioners can train tool-augmented LLMs without collecting or synthesizing labeled SFT demonstrations, significantly reducing data pipeline costs and enabling faster iteration
The in-context curriculum approach is a reusable design pattern: starting RL with guided demonstrations and annealing to zero-shot can likely generalize to other LLM skill acquisition tasks beyond tool use
ICRL suggests that the boundary between inference-time in-context learning and training-time RL can be productively blurred, opening avenues for data-efficient LLM training in low-resource settings

Abstract

While large language models (LLMs) exhibit strong reasoning abilities, their performance on complex tasks is often constrained by the limitations of their internal knowledge. A compelling approach to overcome this challenge is to augment these models with external tools -- such as Python interpreters for mathematical computations or search engines for retrieving factual information. However, enabling models to use these tools effectively remains a significant challenge. Existing methods typically rely on cold-start pipelines that begin with supervised fine-tuning (SFT), followed by reinforcement learning (RL). These approaches often require substantial amounts of labeled data for SFT, which is expensive to annotate or synthesize. In this work, we propose In-Context Reinforcement Learning (ICRL), an RL-only framework that eliminates the need for SFT by leveraging few-shot prompting during the rollout stage of RL. Specifically, ICRL introduces in-context examples within the rollout prompts to teach the model how to invoke external tools. Furthermore, as training progresses, the number of in-context examples is gradually reduced, eventually reaching a zero-shot setting where the model learns to call tools independently. We conduct extensive experiments across a range of reasoning and tool-use benchmarks. Results show that ICRL achieves state-of-the-art performance, demonstrating its effectiveness as a scalable, data-efficient alternative to traditional SFT-based pipelines.