Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards

SYNTHAGENT is a framework that trains small language models to become capable agents by jointly synthesizing diverse tool-use tasks and simulating complete interaction environments with rubric-based RL rewards. This enables small models to match or outperform larger models on agentic benchmarks without relying on real-world APIs.

Problem Statement

Small LLMs significantly underperform large models on agentic tasks requiring tool use, multi-step reasoning, and user interaction. Existing open-source agentic training data is too narrow in task variety and too easy to generalize from. Real-world APIs are unstable and non-diverse, making large-scale reinforcement learning rollouts impractical.

Key Novelty

Joint synthesis of diverse tool-use tasks and complete mock environments using a strong teacher model, including intentionally underspecified instructions that force agents to query users for missing details
An LLM-based user simulator paired with a mock tool system that provides stable, scalable RL rollout environments without dependence on real-world APIs
Task-level rubric-based reward construction grounded in subgoal completion, user-agent interaction quality, and forbidden behavior penalties, enabling fine-grained RL signal for agentic training

Evaluation Highlights

Models trained with SYNTHAGENT achieve substantial performance gains across 14 challenging datasets spanning math, search, and tool-use domains
Small models trained with SYNTHAGENT outperform larger baseline models, demonstrating that synthetic agentic training can overcome parameter-count disadvantages

Breakthrough Assessment

7/10 SYNTHAGENT addresses two well-recognized structural bottlenecks in agentic LLM training simultaneously—data diversity and environment stability—and demonstrates small models beating larger ones, which is a meaningful practical advance for resource-constrained deployments. However, the core ideas (synthetic data generation, RL with rubrics, simulated environments) are extensions of existing paradigms rather than entirely new concepts.

Methodology

A strong teacher model generates novel agentic tasks with associated tool ecosystems, then rewrites task instructions to be intentionally underspecified, creating ambiguity that requires agent-initiated clarification queries
During RL rollout, an LLM-based user simulator responds to agent clarification requests with user-private information, while a mock tool system executes tool calls and returns stable, deterministic responses
Task-level rubrics are constructed per task to evaluate subgoal completion, quality of user-agent interaction, and avoidance of forbidden behaviors, providing structured reward signals for policy optimization

System Components

Teacher Model Task Synthesizer

A strong LLM that generates diverse novel tasks, custom tool ecosystems, and intentionally underspecified instructions to expand training data variety and difficulty

LLM-Based User Simulator

Simulates a human user during RL rollouts by holding user-private information and responding to agent clarification queries, enabling realistic multi-turn interactions without human annotation

Mock Tool System

A stable, simulated tool execution environment that processes agent tool calls and returns consistent responses, replacing unstable real-world APIs for scalable RL training

Rubric-Based Reward Module

Constructs per-task evaluation rubrics based on required subgoals, interaction quality metrics, and forbidden behavior constraints to produce dense, interpretable RL reward signals

Results

Benchmark Category	Larger Baseline Models	SYNTHAGENT Small Model	Delta
Math reasoning tasks	Higher parameter count baseline	Outperforms	Positive gain
Search/retrieval tasks	Higher parameter count baseline	Outperforms	Positive gain
Tool-use tasks	Higher parameter count baseline	Outperforms	Positive gain
Aggregate (14 datasets)	Large model baselines	Substantial gains across all	Small > Large baselines

Key Takeaways

Practitioners can train capable agentic small LLMs without access to real-world APIs by using fully synthetic mock environments—reducing infrastructure costs and enabling stable large-scale RL training
Deliberately underspecifying task instructions during training is an effective technique to teach agents proactive clarification-seeking behavior, which is critical for real-world deployment
Rubric-based reward construction tied to subgoals and interaction quality offers a practical template for designing dense, interpretable reward signals in agentic RL settings beyond simple outcome-based rewards

Abstract

Small LLMs often struggle to match the agentic capabilities of large, costly models. While reinforcement learning can help, progress has been limited by two structural bottlenecks: existing open-source agentic training data are narrow in task variety and easily solved; real-world APIs lack diversity and are unstable for large-scale reinforcement learning rollout processes. We address these challenges with SYNTHAGENT, a framework that jointly synthesizes diverse tool-use training data and simulates complete environments. Specifically, a strong teacher model creates novel tasks and tool ecosystems, then rewrites them into intentionally underspecified instructions. This compels agents to actively query users for missing details. When handling synthetic tasks, an LLM-based user simulator provides user-private information, while a mock tool system delivers stable tool responses. For rewards, task-level rubrics are constructed based on required subgoals, user-agent interactions, and forbidden behaviors. Across 14 challenging datasets in math, search, and tool use, models trained on our synthetic data achieve substantial gains, with small models outperforming larger baselines.