← Back to Papers

OPTAGENT: Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning

Zhenyu Bi, Meng Lu, Yang Li, Swastik Roy, Weijie Guan, Morteza Ziyadi, Xuan Wang
IJCNLP-AACL | 2025
OPTAGENT introduces a verbal reinforcement learning framework that dynamically constructs and refines multi-agent LLM collaboration structures by optimizing the quality of inter-agent communication and debate, rather than just individual agent performance.

Problem Statement

Existing multi-agent LLM systems rely on rigid, predefined collaboration structures or simple aggregation methods like majority voting and round-table debates, which can suppress correct but minority agent contributions. Prior graph-based approaches optimize agent performance in isolation, neglecting the quality of inter-agent communication. This gap means that the collective reasoning potential of multi-agent systems is underutilized due to poor debate and interaction quality.

Key Novelty

  • Verbal Reinforcement Learning applied to multi-agent systems: defines an action space and feedback mechanism that explicitly evaluates and optimizes the quality of agent-to-agent communication, not just final outputs
  • Dynamic collaboration structure construction: the system adaptively builds and refines the multi-agent interaction graph based on communication robustness and coherence signals during debate
  • Holistic debate quality metric: introduces a feedback mechanism that measures both communication robustness and coherence throughout the debate process, enabling principled optimization of interaction patterns

Evaluation Highlights

  • OPTAGENT significantly outperforms single-agent prompting baselines (e.g., Chain-of-Thought, self-consistency) across mathematical reasoning, creative writing, scientific reasoning, and numerical sorting tasks
  • OPTAGENT surpasses state-of-the-art multi-agent frameworks (including graph-based and debate-based methods) on diverse reasoning benchmarks, demonstrating generalization beyond narrow task types

Breakthrough Assessment

6/10 The paper makes a solid and well-motivated contribution by shifting the optimization target in multi-agent LLM systems from individual agent accuracy to inter-agent communication quality via verbal RL, but the core components (RL feedback loops, graph-based agent structures, majority voting) are incremental extensions of existing ideas rather than a paradigm shift.

Methodology

  1. Step 1 – Dynamic Graph Construction: Model the multi-agent system as a graph where nodes are LLM agents and edges represent communication channels; dynamically construct and update the graph topology based on debate progress and communication quality signals
  2. Step 2 – Verbal Reinforcement Learning Loop: Define an action space over agent responses and communication acts; apply a verbal RL algorithm where agents receive feedback rewards based on the robustness and coherence of their contributions to the debate, refining their communication strategies iteratively
  3. Step 3 – Aggregated Decision Making: After the reinforced debate concludes, aggregate all agent outputs via majority voting to produce the final answer, ensuring that refined, high-quality contributions from minority agents are preserved in the voting pool

System Components

Dynamic Collaboration Graph

A graph network representing agents as nodes and their communication pathways as edges; topology is adaptively refined during the debate rather than being fixed or predefined

Verbal RL Action Space

A structured set of communicative actions agents can take during debate (e.g., assert, challenge, revise), enabling reinforcement learning signals to be applied directly to language-level interactions

Communication Feedback Mechanism

An evaluator that scores inter-agent exchanges on robustness (logical consistency under challenge) and coherence (alignment and clarity of arguments) to generate RL reward signals

Majority Vote Aggregator

Final decision module that collects all agent outputs post-debate and selects the answer by majority vote, benefiting from improved individual contributions due to RL-optimized communication

Results

Task Domain Best Baseline (Multi-Agent) OPTAGENT Delta
Mathematical Reasoning SOTA multi-agent framework Significantly higher accuracy Positive improvement
Scientific Reasoning SOTA multi-agent framework Significantly higher accuracy Positive improvement
Creative Writing SOTA multi-agent framework Significantly higher quality Positive improvement
Numerical Sorting SOTA multi-agent framework Significantly higher accuracy Positive improvement

Key Takeaways

  • Optimizing communication quality — not just individual agent outputs — is a critical and underexplored lever for improving multi-agent LLM reasoning; practitioners should consider interaction-level rewards when designing multi-agent pipelines
  • Verbal reinforcement learning provides a practical mechanism to make LLM agent debate more principled, replacing ad-hoc round-table or voting schemes with dynamically adapted, quality-driven collaboration structures
  • The framework's effectiveness across diverse task types (math, science, creative writing, sorting) suggests that communication-aware multi-agent optimization generalizes well, making OPTAGENT a viable drop-in enhancement for existing multi-agent LLM applications

Abstract

Large Language Models (LLMs) have shown remarkable reasoning capabilities in mathematical and scientific tasks. To enhance complex reasoning, multi-agent systems have been proposed to harness the collective intelligence of LLM agents. However, existing collaboration structures are either predefined or rely on majority voting or round-table debates, which can suppress correct but less dominant agent contributions. Recent approaches model multi-agent systems as graph networks but optimize purely for agent performance, neglecting the quality of interactions. We hypothesize that effective agent communication is crucial for multi-agent reasoning and that debating quality plays a significant role. To address this, we propose $\ours$, a multi-agent verbal reinforcement learning algorithm that dynamically constructs and refines multi-agent collaboration structures. Our method defines action spaces and a feedback mechanism that evaluates communication robustness and coherence throughout the debate. The final decision is achieved through a majority vote over all the agents. We assess $\ours$ on various reasoning tasks, including mathematical reasoning, creative writing, scientific reasoning, and numerical sorting. Results demonstrate that our approach significantly outperforms single-agent prompting methods and state-of-the-art multi-agent frameworks on diverse tasks.

Generated on 2026-04-01 using Claude