ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling

ToolPRM is a process reward model that enables fine-grained inference scaling for structured function calling outputs by scoring internal steps within individual function calls, guided by the principle of 'explore more but retain less'.

Problem Statement

Inference scaling has shown strong results for unstructured text generation, but its application to structured outputs like function calls remains largely unexplored. Function calling is a core mechanism for LLM-based agents, and errors in structured outputs are often unrecoverable once a wrong structural decision is made. Existing reward models operate at coarse-grained or outcome levels, failing to provide the step-level supervision needed to guide structured generation effectively.

Key Novelty

ToolPRM: the first process reward model designed specifically for fine-grained intra-call step supervision in structured function calling
A novel fine-grained intra-call process supervision dataset automatically annotated using function-masking techniques to provide step-level rewards
Discovery and formalization of the 'explore more but retain less' principle for applying inference scaling to structured outputs, motivated by the unrecoverability of structured generation errors

Evaluation Highlights

ToolPRM outperforms both coarse-grained and outcome reward models in predictive accuracy for supervising the function calling inference process
Inference scaling with ToolPRM significantly improves backbone model performance across multiple function calling tasks and benchmarks compared to baselines

Signal Assessment

7/10 The paper makes a significant and practical advance by extending inference scaling to structured outputs—an underexplored but critically important domain for agentic AI—while also contributing a principled insight ('explore more but retain less') and a novel automated data construction method that can generalize beyond this specific task.

Methodology

Construct a fine-grained intra-call process supervision dataset by decomposing function calls into internal steps and automatically annotating step-level rewards using function-masking techniques
Train ToolPRM on this dataset to score intermediate steps within a single function call, enabling more granular supervision than coarse-grained or outcome-level reward models
Apply fine-grained beam search guided by ToolPRM during inference, using the 'explore more but retain less' strategy to expand candidate paths broadly but aggressively prune low-quality structured outputs due to their unrecoverable error characteristics

System Components

ToolPRM

A process reward model that assigns step-level scores to internal steps within a single function call, providing fine-grained supervision for structured output generation

Fine-Grained Intra-Call Dataset

A process supervision dataset where function calls are decomposed into steps and automatically annotated with step-level rewards using function-masking techniques

Fine-Grained Beam Search

An inference-time search strategy that uses ToolPRM scores to explore multiple candidate structured outputs while aggressively pruning unlikely continuations

Explore More but Retain Less Principle

A key design principle for inference scaling on structured outputs that advocates wide exploration of candidates but aggressive retention/pruning due to the unrecoverability of structural errors in function calls

Results

Metric/Benchmark	Baseline (Coarse/ORM)	ToolPRM (Fine-Grained PRM)	Delta
Step-level predictive accuracy	Lower (coarse-grained/ORM)	Higher	Significant improvement
Function calling benchmark performance	Backbone model baseline	Substantially improved with inference scaling	Significant improvement
Multi-benchmark function calling tasks	Standard decoding / coarse reward	ToolPRM + beam search	Consistent gains across benchmarks

Key Takeaways

When applying inference scaling to structured outputs like function calls, use the 'explore more but retain less' strategy—generate a wide beam of candidates but aggressively prune because structural errors early in generation are unrecoverable
Process reward models trained at the intra-call step level provide meaningfully better supervision signals than outcome-level or coarse-grained reward models for function calling tasks, making them worth the investment for production agentic systems
The function-masking annotation technique offers a scalable, automatic way to construct step-level supervision data for structured outputs without expensive human labeling, which is broadly applicable to other structured generation domains

Abstract

Large language models (LLMs) are increasingly demonstrating strong capabilities as autonomous agents, with function calling serving as a core mechanism for interaction with the environment. Meanwhile, inference scaling has become a cutting-edge technique to enhance LLM performance by allocating more computational resources during the inference process. However, current research on inference scaling primarily focuses on unstructured output generation tasks, leaving its application in structured outputs, like function calling, largely underexplored. To bridge this gap, we propose an inference scaling framework that combines fine-grained beam search with a process reward model, ToolPRM, which scores the internal steps of each single function call. To train ToolPRM, we construct the first fine-grained intra-call process supervision dataset, automatically annotated with function-masking techniques to provide step-level rewards for structured tool-use reasoning. Extensive experiments demonstrate that ToolPRM beats the coarse-grained and outcome reward models in terms of predictive accuracy, indicating its stronger capability in supervising the function calling inference process. Inference scaling technique equipped with ToolPRM also significantly improves the backbone model performance across various function calling tasks and benchmarks. More importantly, we reveal a key principle for applying inference scaling techniques to structured outputs:"explore more but retain less"due to the unrecoverability characteristics of structured function calling generation.