ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
Problem Statement
Inference scaling has shown strong results for unstructured text generation, but its application to structured outputs like function calls remains largely unexplored. Function calling is a core mechanism for LLM-based agents, and errors in structured outputs are often unrecoverable once a wrong structural decision is made. Existing reward models operate at coarse-grained or outcome levels, failing to provide the step-level supervision needed to guide structured generation effectively.
Key Novelty
- ToolPRM: the first process reward model designed specifically for fine-grained intra-call step supervision in structured function calling
- A novel fine-grained intra-call process supervision dataset automatically annotated using function-masking techniques to provide step-level rewards
- Discovery and formalization of the 'explore more but retain less' principle for applying inference scaling to structured outputs, motivated by the unrecoverability of structured generation errors
Evaluation Highlights
- ToolPRM outperforms both coarse-grained and outcome reward models in predictive accuracy for supervising the function calling inference process
- Inference scaling with ToolPRM significantly improves backbone model performance across multiple function calling tasks and benchmarks compared to baselines
Breakthrough Assessment
Methodology
- Construct a fine-grained intra-call process supervision dataset by decomposing function calls into internal steps and automatically annotating step-level rewards using function-masking techniques
- Train ToolPRM on this dataset to score intermediate steps within a single function call, enabling more granular supervision than coarse-grained or outcome-level reward models
- Apply fine-grained beam search guided by ToolPRM during inference, using the 'explore more but retain less' strategy to expand candidate paths broadly but aggressively prune low-quality structured outputs due to their unrecoverable error characteristics
System Components
A process reward model that assigns step-level scores to internal steps within a single function call, providing fine-grained supervision for structured output generation
A process supervision dataset where function calls are decomposed into steps and automatically annotated with step-level rewards using function-masking techniques
An inference-time search strategy that uses ToolPRM scores to explore multiple candidate structured outputs while aggressively pruning unlikely continuations
A key design principle for inference scaling on structured outputs that advocates wide exploration of candidates but aggressive retention/pruning due to the unrecoverability of structural errors in function calls
Results
| Metric/Benchmark | Baseline (Coarse/ORM) | ToolPRM (Fine-Grained PRM) | Delta |
|---|---|---|---|
| Step-level predictive accuracy | Lower (coarse-grained/ORM) | Higher | Significant improvement |
| Function calling benchmark performance | Backbone model baseline | Substantially improved with inference scaling | Significant improvement |
| Multi-benchmark function calling tasks | Standard decoding / coarse reward | ToolPRM + beam search | Consistent gains across benchmarks |
Key Takeaways
- When applying inference scaling to structured outputs like function calls, use the 'explore more but retain less' strategy—generate a wide beam of candidates but aggressively prune because structural errors early in generation are unrecoverable
- Process reward models trained at the intra-call step level provide meaningfully better supervision signals than outcome-level or coarse-grained reward models for function calling tasks, making them worth the investment for production agentic systems
- The function-masking annotation technique offers a scalable, automatic way to construct step-level supervision data for structured outputs without expensive human labeling, which is broadly applicable to other structured generation domains
Abstract
Large language models (LLMs) are increasingly demonstrating strong capabilities as autonomous agents, with function calling serving as a core mechanism for interaction with the environment. Meanwhile, inference scaling has become a cutting-edge technique to enhance LLM performance by allocating more computational resources during the inference process. However, current research on inference scaling primarily focuses on unstructured output generation tasks, leaving its application in structured outputs, like function calling, largely underexplored. To bridge this gap, we propose an inference scaling framework that combines fine-grained beam search with a process reward model, ToolPRM, which scores the internal steps of each single function call. To train ToolPRM, we construct the first fine-grained intra-call process supervision dataset, automatically annotated with function-masking techniques to provide step-level rewards for structured tool-use reasoning. Extensive experiments demonstrate that ToolPRM beats the coarse-grained and outcome reward models in terms of predictive accuracy, indicating its stronger capability in supervising the function calling inference process. Inference scaling technique equipped with ToolPRM also significantly improves the backbone model performance across various function calling tasks and benchmarks. More importantly, we reveal a key principle for applying inference scaling techniques to structured outputs:"explore more but retain less"due to the unrecoverability characteristics of structured function calling generation.