Talk Less, Call Right: Enhancing Role-Play LLM Agents with Automatic Prompt Optimization and Role Prompting
Problem Statement
Tool-augmented LLM dialogue agents in persona-grounded settings suffer from two key failure modes: over-speaking (excessively long in-character responses) and under-acting (poor tool use, including hallucinated or unnecessary function calls). Existing prompting strategies lack the structured behavioral constraints needed to reliably align agent output with both persona requirements and tool-use protocols. This work addresses the gap between elaborate automated optimization methods and the practical need for reliable, interpretable prompting solutions in competitive dialogue system settings.
Key Novelty
- Character-card/scene-contract design: a structured prompt format that explicitly defines persona attributes and behavioral rules as a 'contract' the agent must follow during dialogue
- Strict function-calling enforcement: explicit prompt-level rules that constrain when and how tools may be invoked, reducing hallucinated and unnecessary API calls
- Open-sourced APO tool and best-performing prompts for the CPDC 2025 API track, enabling reproducibility and community reuse
Evaluation Highlights
- Rule-based role prompting (RRP) achieved an overall score of 0.571 on the CPDC 2025 API track, compared to a zero-shot baseline of 0.519 (+0.052 absolute improvement)
- RRP outperformed all other tested approaches including basic role prompting, improved role prompting, and automatic prompt optimization (APO), demonstrating that structured rule-based constraints are more effective than automated optimization for this task
Breakthrough Assessment
Methodology
- Identified two core failure modes in tool-augmented role-play agents (over-speaking and under-acting) through analysis of the CPDC 2025 API track task and baseline system outputs
- Designed and evaluated four prompting strategies in sequence: basic role prompting → improved role prompting → automatic prompt optimization (APO) → rule-based role prompting (RRP), iterating based on observed failure patterns
- Developed the RRP approach using character-card/scene-contract prompt structures and explicit function-calling rules, then evaluated all approaches on the CPDC 2025 benchmark to compare overall dialogue agent scores
System Components
A structured prompt template that encodes persona attributes, behavioral expectations, and situational rules as an explicit 'contract,' ensuring the agent consistently embodies the role and respects dialogue context
Prompt-level instructions that define precise conditions under which tools should or should not be called, preventing hallucinated function names and unnecessary pre-answer tool invocations
An automated system for iteratively refining prompts based on performance feedback, open-sourced for community use despite being outperformed by RRP in this setting
A progression of role prompting strategies (basic → improved) used as baselines and stepping stones toward the final RRP design, providing ablation-style insight into what prompt elements matter most
Results
| Approach | Overall Score | This Paper (RRP) | Delta vs. Baseline |
|---|---|---|---|
| Zero-shot Baseline | 0.519 | 0.571 | +0.052 |
| Basic Role Prompting | Below RRP | 0.571 | Positive |
| Improved Role Prompting | Below RRP | 0.571 | Positive |
| Automatic Prompt Optimization (APO) | Below RRP | 0.571 | Positive |
Key Takeaways
- Explicit, rule-based prompt structures (character cards + scene contracts + strict tool-use rules) are more reliable than automated prompt optimization for constraining LLM behavior in structured role-play and tool-use settings
- Over-speaking and under-acting are distinct, addressable failure modes in persona-grounded dialogue agents—targeted prompt constraints for each issue yield measurable performance gains without complex training or fine-tuning
- Open-sourcing optimized prompts and APO tooling enables practitioners to bootstrap persona prompt development for similar dialogue agent tasks, reducing iteration cost in competition and production settings
Abstract
This report investigates approaches for prompting a tool-augmented large language model (LLM) to act as a role-playing dialogue agent in the API track of the Commonsense Persona-grounded Dialogue Challenge (CPDC) 2025. In this setting, dialogue agents often produce overly long in-character responses (over-speaking) while failing to use tools effectively according to the persona (under-acting), such as generating function calls that do not exist or making unnecessary tool calls before answering. We explore four prompting approaches to address these issues: 1) basic role prompting, 2) improved role prompting, 3) automatic prompt optimization (APO), and 4) rule-based role prompting. The rule-based role prompting (RRP) approach achieved the best performance through two novel techniques-character-card/scene-contract design and strict enforcement of function calling-which led to an overall score of 0.571, improving on the zero-shot baseline score of 0.519. These findings demonstrate that RRP design can substantially improve the effectiveness and reliability of role-playing dialogue agents compared with more elaborate methods such as APO. To support future efforts in developing persona prompts, we are open-sourcing all of our best-performing prompts and the APO tool Source code is available at https://github.com/scb-10x/apo