A Dynamic Agent Framework for Large Language Model Reasoning for Medical and Visual Question Answering
Problem Statement
Medical VQA requires simultaneous question comprehension, knowledge retrieval, and inference, making it hard to isolate and optimize individual reasoning steps. Static prompt paradigms fail to leverage the diversity of reasoning strategies LLMs can employ for different question types. Existing benchmarks also lack fine-grained metrics to evaluate context-specific reasoning patterns beyond overall accuracy.
Key Novelty
- Explicit disentanglement of the medical VQA reasoning pipeline into three distinct components: a question-agent latent space module, a planning module, and a judgment/inference module
- Reward-based dynamic agent selection that adaptively chooses the optimal reasoning paradigm per question, moving beyond static prompting
- Temporal iterative framework that accumulates experience in a latent space and refines agent selection over time for improved efficiency and accuracy
Evaluation Highlights
- Improved zero-shot accuracy on the VQA-RAD medical VQA benchmark compared to static prompt baselines
- Improved zero-shot accuracy on the SLAKE multilingual medical VQA benchmark, demonstrating cross-dataset generalization
Breakthrough Assessment
Methodology
- Encode incoming medical VQA questions into a latent space module that stores prior agent experiences and question-type representations to inform future agent selection
- Use a planning module to score and select the optimal reasoning agent/paradigm from a pool of candidates based on the question context and accumulated latent experience
- Execute the selected agent to generate a candidate answer, apply reward scoring in the judgment module, and iteratively refine until a final answer is produced
System Components
Encodes questions and accumulates historical agent performance experience to build a dynamic representation that informs subsequent planning decisions
Selects the optimal reasoning agent or strategy from available options by leveraging the latent space representations and reward signals from previous iterations
Evaluates the quality of generated answers via reward scoring, decides whether to accept or trigger further reasoning iterations, and produces the final answer
Results
| Metric/Benchmark | Baseline (Static Prompt) | MedDAF (Dynamic) | Delta |
|---|---|---|---|
| Zero-shot Accuracy (VQA-RAD) | Competitive static baseline | Improved accuracy | Positive gain reported |
| Zero-shot Accuracy (SLAKE) | Competitive static baseline | Improved accuracy | Positive gain reported |
| Reasoning Efficiency | Fixed single-pass inference | Iterative with early stopping | Improved via reward gating |
Key Takeaways
- Decomposing complex VQA pipelines into explicit, separable modules (comprehension, planning, judgment) is a practical strategy for improving debuggability and targeted optimization in medical AI systems
- Dynamic, reward-guided agent selection outperforms static prompting for domain-specific VQA, suggesting practitioners should consider question-adaptive reasoning strategies rather than one-size-fits-all prompts
- Latent experience accumulation across questions enables a form of in-context meta-learning that can improve zero-shot performance without fine-tuning, a useful pattern for low-resource medical deployment scenarios
Abstract
Eliciting the reasoning capability of Large Language Models (LLMs) for medical Visual Question Answering (VQA) remains essential yet challenging. First, the entanglement of question comprehension, knowledge retrieval, and inference processes makes it difficult to isolate and analyze individual reasoning components. Second, addressing different medical questions requires various reasoning paradigms, while the traditional static prompt paradigm does not take full advantage of the diverse instructions that LLMs can provide. These limitations are compounded by current medical VQA benchmarks, which lack granular metrics to evaluate context-specific reasoning patterns beyond overall accuracy. To address those challenges, in this paper, we propose a reward-based Dynamic Agent Framework (MedDAF) to improve the reasoning in medical VQA. Our framework explicitly disentangles the reasoning pipeline into three components: (1) a question-agent latent space module for experience accumulation, (2) a planning module for optimal agent selection, and (3) a judgment and inference module with reward scoring and final answer generation. Through this iterative process, our framework forms a temporal dynamic agent framework to enhance accuracy and efficiency for medical VQA. Experiments on VQA-RAD and SLAKE have shown that our approach is effective in improving the accuracy and efficiency of zero-shot performance in medical VQA tasks. The implementation code is publicly available on https://github.com/Ziyan-Xiao/MedDAF/.