← Back to Papers

A Dynamic Agent Framework for Large Language Model Reasoning for Medical and Visual Question Answering

Ziyan Xiao, Ruiyang Zhang, Yushi Feng, Lingting Zhu, Liang Peng, Lequan Yu
2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) | 2025
MedDAF is a reward-based Dynamic Agent Framework that disentangles medical Visual Question Answering into three modular reasoning components, enabling adaptive agent selection and iterative refinement to improve zero-shot LLM performance on medical VQA benchmarks.

Problem Statement

Medical VQA requires simultaneous question comprehension, knowledge retrieval, and inference, making it hard to isolate and optimize individual reasoning steps. Static prompt paradigms fail to leverage the diversity of reasoning strategies LLMs can employ for different question types. Existing benchmarks also lack fine-grained metrics to evaluate context-specific reasoning patterns beyond overall accuracy.

Key Novelty

  • Explicit disentanglement of the medical VQA reasoning pipeline into three distinct components: a question-agent latent space module, a planning module, and a judgment/inference module
  • Reward-based dynamic agent selection that adaptively chooses the optimal reasoning paradigm per question, moving beyond static prompting
  • Temporal iterative framework that accumulates experience in a latent space and refines agent selection over time for improved efficiency and accuracy

Evaluation Highlights

  • Improved zero-shot accuracy on the VQA-RAD medical VQA benchmark compared to static prompt baselines
  • Improved zero-shot accuracy on the SLAKE multilingual medical VQA benchmark, demonstrating cross-dataset generalization

Breakthrough Assessment

5/10 MedDAF presents a solid and well-structured contribution to medical VQA by systematically decomposing reasoning and introducing dynamic agent selection, but the core ideas of modular agent pipelines and reward-based planning are incremental extensions of existing agentic LLM frameworks rather than a paradigm shift.

Methodology

  1. Encode incoming medical VQA questions into a latent space module that stores prior agent experiences and question-type representations to inform future agent selection
  2. Use a planning module to score and select the optimal reasoning agent/paradigm from a pool of candidates based on the question context and accumulated latent experience
  3. Execute the selected agent to generate a candidate answer, apply reward scoring in the judgment module, and iteratively refine until a final answer is produced

System Components

Question-Agent Latent Space Module

Encodes questions and accumulates historical agent performance experience to build a dynamic representation that informs subsequent planning decisions

Planning Module

Selects the optimal reasoning agent or strategy from available options by leveraging the latent space representations and reward signals from previous iterations

Judgment and Inference Module

Evaluates the quality of generated answers via reward scoring, decides whether to accept or trigger further reasoning iterations, and produces the final answer

Results

Metric/Benchmark Baseline (Static Prompt) MedDAF (Dynamic) Delta
Zero-shot Accuracy (VQA-RAD) Competitive static baseline Improved accuracy Positive gain reported
Zero-shot Accuracy (SLAKE) Competitive static baseline Improved accuracy Positive gain reported
Reasoning Efficiency Fixed single-pass inference Iterative with early stopping Improved via reward gating

Key Takeaways

  • Decomposing complex VQA pipelines into explicit, separable modules (comprehension, planning, judgment) is a practical strategy for improving debuggability and targeted optimization in medical AI systems
  • Dynamic, reward-guided agent selection outperforms static prompting for domain-specific VQA, suggesting practitioners should consider question-adaptive reasoning strategies rather than one-size-fits-all prompts
  • Latent experience accumulation across questions enables a form of in-context meta-learning that can improve zero-shot performance without fine-tuning, a useful pattern for low-resource medical deployment scenarios

Abstract

Eliciting the reasoning capability of Large Language Models (LLMs) for medical Visual Question Answering (VQA) remains essential yet challenging. First, the entanglement of question comprehension, knowledge retrieval, and inference processes makes it difficult to isolate and analyze individual reasoning components. Second, addressing different medical questions requires various reasoning paradigms, while the traditional static prompt paradigm does not take full advantage of the diverse instructions that LLMs can provide. These limitations are compounded by current medical VQA benchmarks, which lack granular metrics to evaluate context-specific reasoning patterns beyond overall accuracy. To address those challenges, in this paper, we propose a reward-based Dynamic Agent Framework (MedDAF) to improve the reasoning in medical VQA. Our framework explicitly disentangles the reasoning pipeline into three components: (1) a question-agent latent space module for experience accumulation, (2) a planning module for optimal agent selection, and (3) a judgment and inference module with reward scoring and final answer generation. Through this iterative process, our framework forms a temporal dynamic agent framework to enhance accuracy and efficiency for medical VQA. Experiments on VQA-RAD and SLAKE have shown that our approach is effective in improving the accuracy and efficiency of zero-shot performance in medical VQA tasks. The implementation code is publicly available on https://github.com/Ziyan-Xiao/MedDAF/.

Generated on 2026-03-03 using Claude