← Back to Papers

ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling

Jianghao Lin, Yuanyuan Shi, Xing Peng, Renjie Ding, Hairui Wang, Yuxuan Peng, Bizhe Bai, Wei Song, Fengshuo Bai, Huacan Chai, Weinan Zhang, Fei Huang, Ying Wen
arXiv.org | 2025
ToolPRM is a process reward model that enables fine-grained inference scaling for structured function calling outputs by scoring internal steps within individual function calls, guided by the principle of 'explore more but retain less'.

Problem Statement

Inference scaling has shown strong results for unstructured text generation, but its application to structured outputs like function calls remains largely unexplored. Function calling is a core mechanism for LLM-based agents, and errors in structured outputs are often unrecoverable once a wrong structural decision is made. Existing reward models operate at coarse-grained or outcome levels, failing to provide the step-level supervision needed to guide structured generation effectively.

Key Novelty

  • ToolPRM: the first process reward model designed specifically for fine-grained intra-call step supervision in structured function calling
  • A novel fine-grained intra-call process supervision dataset automatically annotated using function-masking techniques to provide step-level rewards
  • Discovery and formalization of the 'explore more but retain less' principle for applying inference scaling to structured outputs, motivated by the unrecoverability of structured generation errors

Evaluation Highlights

  • ToolPRM outperforms both coarse-grained and outcome reward models in predictive accuracy for supervising the function calling inference process
  • Inference scaling with ToolPRM significantly improves backbone model performance across multiple function calling tasks and benchmarks compared to baselines

Breakthrough Assessment

7/10 The paper makes a significant and practical advance by extending inference scaling to structured outputs—an underexplored but critically important domain for agentic AI—while also contributing a principled insight ('explore more but retain less') and a novel automated data construction method that can generalize beyond this specific task.

Methodology

  1. Construct a fine-grained intra-call process supervision dataset by decomposing function calls into internal steps and automatically annotating step-level rewards using function-masking techniques
  2. Train ToolPRM on this dataset to score intermediate steps within a single function call, enabling more granular supervision than coarse-grained or outcome-level reward models
  3. Apply fine-grained beam search guided by ToolPRM during inference, using the 'explore more but retain less' strategy to expand candidate paths broadly but aggressively prune low-quality structured outputs due to their unrecoverable error characteristics

System Components

ToolPRM

A process reward model that assigns step-level scores to internal steps within a single function call, providing fine-grained supervision for structured output generation

Fine-Grained Intra-Call Dataset

A process supervision dataset where function calls are decomposed into steps and automatically annotated with step-level rewards using function-masking techniques

Fine-Grained Beam Search

An inference-time search strategy that uses ToolPRM scores to explore multiple candidate structured outputs while aggressively pruning unlikely continuations

Explore More but Retain Less Principle

A key design principle for inference scaling on structured outputs that advocates wide exploration of candidates but aggressive retention/pruning due to the unrecoverability of structural errors in function calls

Results

Metric/Benchmark Baseline (Coarse/ORM) ToolPRM (Fine-Grained PRM) Delta
Step-level predictive accuracy Lower (coarse-grained/ORM) Higher Significant improvement
Function calling benchmark performance Backbone model baseline Substantially improved with inference scaling Significant improvement
Multi-benchmark function calling tasks Standard decoding / coarse reward ToolPRM + beam search Consistent gains across benchmarks

Key Takeaways

  • When applying inference scaling to structured outputs like function calls, use the 'explore more but retain less' strategy—generate a wide beam of candidates but aggressively prune because structural errors early in generation are unrecoverable
  • Process reward models trained at the intra-call step level provide meaningfully better supervision signals than outcome-level or coarse-grained reward models for function calling tasks, making them worth the investment for production agentic systems
  • The function-masking annotation technique offers a scalable, automatic way to construct step-level supervision data for structured outputs without expensive human labeling, which is broadly applicable to other structured generation domains

Abstract

Large language models (LLMs) are increasingly demonstrating strong capabilities as autonomous agents, with function calling serving as a core mechanism for interaction with the environment. Meanwhile, inference scaling has become a cutting-edge technique to enhance LLM performance by allocating more computational resources during the inference process. However, current research on inference scaling primarily focuses on unstructured output generation tasks, leaving its application in structured outputs, like function calling, largely underexplored. To bridge this gap, we propose an inference scaling framework that combines fine-grained beam search with a process reward model, ToolPRM, which scores the internal steps of each single function call. To train ToolPRM, we construct the first fine-grained intra-call process supervision dataset, automatically annotated with function-masking techniques to provide step-level rewards for structured tool-use reasoning. Extensive experiments demonstrate that ToolPRM beats the coarse-grained and outcome reward models in terms of predictive accuracy, indicating its stronger capability in supervising the function calling inference process. Inference scaling technique equipped with ToolPRM also significantly improves the backbone model performance across various function calling tasks and benchmarks. More importantly, we reveal a key principle for applying inference scaling techniques to structured outputs:"explore more but retain less"due to the unrecoverability characteristics of structured function calling generation.

Generated on 2026-03-02 using Claude