← Back to Papers

Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards

Yuan-Jay Lu, Chengyu Wang, Lei Shen, Jun Huang, Tong Xu
2026
SYNTHAGENT is a framework that trains small language models to become capable agents by jointly synthesizing diverse tool-use tasks and simulating complete interaction environments with rubric-based RL rewards. This enables small models to match or outperform larger models on agentic benchmarks without relying on real-world APIs.

Problem Statement

Small LLMs significantly underperform large models on agentic tasks requiring tool use, multi-step reasoning, and user interaction. Existing open-source agentic training data is too narrow in task variety and too easy to generalize from. Real-world APIs are unstable and non-diverse, making large-scale reinforcement learning rollouts impractical.

Key Novelty

  • Joint synthesis of diverse tool-use tasks and complete mock environments using a strong teacher model, including intentionally underspecified instructions that force agents to query users for missing details
  • An LLM-based user simulator paired with a mock tool system that provides stable, scalable RL rollout environments without dependence on real-world APIs
  • Task-level rubric-based reward construction grounded in subgoal completion, user-agent interaction quality, and forbidden behavior penalties, enabling fine-grained RL signal for agentic training

Evaluation Highlights

  • Models trained with SYNTHAGENT achieve substantial performance gains across 14 challenging datasets spanning math, search, and tool-use domains
  • Small models trained with SYNTHAGENT outperform larger baseline models, demonstrating that synthetic agentic training can overcome parameter-count disadvantages

Breakthrough Assessment

7/10 SYNTHAGENT addresses two well-recognized structural bottlenecks in agentic LLM training simultaneously—data diversity and environment stability—and demonstrates small models beating larger ones, which is a meaningful practical advance for resource-constrained deployments. However, the core ideas (synthetic data generation, RL with rubrics, simulated environments) are extensions of existing paradigms rather than entirely new concepts.

Methodology

  1. A strong teacher model generates novel agentic tasks with associated tool ecosystems, then rewrites task instructions to be intentionally underspecified, creating ambiguity that requires agent-initiated clarification queries
  2. During RL rollout, an LLM-based user simulator responds to agent clarification requests with user-private information, while a mock tool system executes tool calls and returns stable, deterministic responses
  3. Task-level rubrics are constructed per task to evaluate subgoal completion, quality of user-agent interaction, and avoidance of forbidden behaviors, providing structured reward signals for policy optimization

System Components

Teacher Model Task Synthesizer

A strong LLM that generates diverse novel tasks, custom tool ecosystems, and intentionally underspecified instructions to expand training data variety and difficulty

LLM-Based User Simulator

Simulates a human user during RL rollouts by holding user-private information and responding to agent clarification queries, enabling realistic multi-turn interactions without human annotation

Mock Tool System

A stable, simulated tool execution environment that processes agent tool calls and returns consistent responses, replacing unstable real-world APIs for scalable RL training

Rubric-Based Reward Module

Constructs per-task evaluation rubrics based on required subgoals, interaction quality metrics, and forbidden behavior constraints to produce dense, interpretable RL reward signals

Results

Benchmark Category Larger Baseline Models SYNTHAGENT Small Model Delta
Math reasoning tasks Higher parameter count baseline Outperforms Positive gain
Search/retrieval tasks Higher parameter count baseline Outperforms Positive gain
Tool-use tasks Higher parameter count baseline Outperforms Positive gain
Aggregate (14 datasets) Large model baselines Substantial gains across all Small > Large baselines

Key Takeaways

  • Practitioners can train capable agentic small LLMs without access to real-world APIs by using fully synthetic mock environments—reducing infrastructure costs and enabling stable large-scale RL training
  • Deliberately underspecifying task instructions during training is an effective technique to teach agents proactive clarification-seeking behavior, which is critical for real-world deployment
  • Rubric-based reward construction tied to subgoals and interaction quality offers a practical template for designing dense, interpretable reward signals in agentic RL settings beyond simple outcome-based rewards

Abstract

Small LLMs often struggle to match the agentic capabilities of large, costly models. While reinforcement learning can help, progress has been limited by two structural bottlenecks: existing open-source agentic training data are narrow in task variety and easily solved; real-world APIs lack diversity and are unstable for large-scale reinforcement learning rollout processes. We address these challenges with SYNTHAGENT, a framework that jointly synthesizes diverse tool-use training data and simulates complete environments. Specifically, a strong teacher model creates novel tasks and tool ecosystems, then rewrites them into intentionally underspecified instructions. This compels agents to actively query users for missing details. When handling synthetic tasks, an LLM-based user simulator provides user-private information, while a mock tool system delivers stable tool responses. For rewards, task-level rubrics are constructed based on required subgoals, user-agent interactions, and forbidden behaviors. Across 14 challenging datasets in math, search, and tool use, models trained on our synthetic data achieve substantial gains, with small models outperforming larger baselines.

Generated on 2026-02-21 using Claude