OPTAGENT: Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning
Problem Statement
Existing multi-agent LLM systems rely on rigid, predefined collaboration structures or simple aggregation methods like majority voting and round-table debates, which can suppress correct but minority agent contributions. Prior graph-based approaches optimize agent performance in isolation, neglecting the quality of inter-agent communication. This gap means that the collective reasoning potential of multi-agent systems is underutilized due to poor debate and interaction quality.
Key Novelty
- Verbal Reinforcement Learning applied to multi-agent systems: defines an action space and feedback mechanism that explicitly evaluates and optimizes the quality of agent-to-agent communication, not just final outputs
- Dynamic collaboration structure construction: the system adaptively builds and refines the multi-agent interaction graph based on communication robustness and coherence signals during debate
- Holistic debate quality metric: introduces a feedback mechanism that measures both communication robustness and coherence throughout the debate process, enabling principled optimization of interaction patterns
Evaluation Highlights
- OPTAGENT significantly outperforms single-agent prompting baselines (e.g., Chain-of-Thought, self-consistency) across mathematical reasoning, creative writing, scientific reasoning, and numerical sorting tasks
- OPTAGENT surpasses state-of-the-art multi-agent frameworks (including graph-based and debate-based methods) on diverse reasoning benchmarks, demonstrating generalization beyond narrow task types
Breakthrough Assessment
Methodology
- Step 1 – Dynamic Graph Construction: Model the multi-agent system as a graph where nodes are LLM agents and edges represent communication channels; dynamically construct and update the graph topology based on debate progress and communication quality signals
- Step 2 – Verbal Reinforcement Learning Loop: Define an action space over agent responses and communication acts; apply a verbal RL algorithm where agents receive feedback rewards based on the robustness and coherence of their contributions to the debate, refining their communication strategies iteratively
- Step 3 – Aggregated Decision Making: After the reinforced debate concludes, aggregate all agent outputs via majority voting to produce the final answer, ensuring that refined, high-quality contributions from minority agents are preserved in the voting pool
System Components
A graph network representing agents as nodes and their communication pathways as edges; topology is adaptively refined during the debate rather than being fixed or predefined
A structured set of communicative actions agents can take during debate (e.g., assert, challenge, revise), enabling reinforcement learning signals to be applied directly to language-level interactions
An evaluator that scores inter-agent exchanges on robustness (logical consistency under challenge) and coherence (alignment and clarity of arguments) to generate RL reward signals
Final decision module that collects all agent outputs post-debate and selects the answer by majority vote, benefiting from improved individual contributions due to RL-optimized communication
Results
| Task Domain | Best Baseline (Multi-Agent) | OPTAGENT | Delta |
|---|---|---|---|
| Mathematical Reasoning | SOTA multi-agent framework | Significantly higher accuracy | Positive improvement |
| Scientific Reasoning | SOTA multi-agent framework | Significantly higher accuracy | Positive improvement |
| Creative Writing | SOTA multi-agent framework | Significantly higher quality | Positive improvement |
| Numerical Sorting | SOTA multi-agent framework | Significantly higher accuracy | Positive improvement |
Key Takeaways
- Optimizing communication quality — not just individual agent outputs — is a critical and underexplored lever for improving multi-agent LLM reasoning; practitioners should consider interaction-level rewards when designing multi-agent pipelines
- Verbal reinforcement learning provides a practical mechanism to make LLM agent debate more principled, replacing ad-hoc round-table or voting schemes with dynamically adapted, quality-driven collaboration structures
- The framework's effectiveness across diverse task types (math, science, creative writing, sorting) suggests that communication-aware multi-agent optimization generalizes well, making OPTAGENT a viable drop-in enhancement for existing multi-agent LLM applications
Abstract
Large Language Models (LLMs) have shown remarkable reasoning capabilities in mathematical and scientific tasks. To enhance complex reasoning, multi-agent systems have been proposed to harness the collective intelligence of LLM agents. However, existing collaboration structures are either predefined or rely on majority voting or round-table debates, which can suppress correct but less dominant agent contributions. Recent approaches model multi-agent systems as graph networks but optimize purely for agent performance, neglecting the quality of interactions. We hypothesize that effective agent communication is crucial for multi-agent reasoning and that debating quality plays a significant role. To address this, we propose $\ours$, a multi-agent verbal reinforcement learning algorithm that dynamically constructs and refines multi-agent collaboration structures. Our method defines action spaces and a feedback mechanism that evaluates communication robustness and coherence throughout the debate. The final decision is achieved through a majority vote over all the agents. We assess $\ours$ on various reasoning tasks, including mathematical reasoning, creative writing, scientific reasoning, and numerical sorting. Results demonstrate that our approach significantly outperforms single-agent prompting methods and state-of-the-art multi-agent frameworks on diverse tasks.