A Survey of Large Language Model Agents for Question Answering

This survey systematically reviews LLM-based agents for question answering, organizing the design space across planning, question understanding, information retrieval, and answer generation stages. LLM-based agents overcome the data requirements and generalization limitations of traditional QA pipelines by using LLMs as a core reasoning engine with external environment interaction.

Problem Statement

Traditional QA agents require substantial labeled data and struggle to generalize across new domains or environments, limiting their practical utility. Naive LLM QA systems lack the ability to interact with external knowledge sources, constraining accuracy and factual grounding. A unified survey organizing the rapidly growing space of LLM agent QA systems was lacking, making it difficult for researchers to understand the landscape and identify open challenges.

Key Novelty

Systematic taxonomy of LLM agent QA systems organized across four key functional stages: planning, question understanding, information retrieval, and answer generation
Comparative framing that distinguishes LLM-based agents from both traditional QA pipelines and naive LLM QA approaches, clarifying when and why agents provide superior performance
Identification of ongoing challenges and future research directions specific to LLM agent QA, providing a roadmap for the research community

Evaluation Highlights

Qualitative comparison showing LLM-based agents achieve superior QA results over traditional pipelines and standalone LLM systems through external environment interaction
Survey coverage spans key design dimensions (planning, question understanding, retrieval, generation), providing comprehensive landscape analysis rather than single benchmark metrics

Breakthrough Assessment

4/10 This is a solid survey contribution that organizes a fragmented and fast-moving research area into a coherent framework, but as a survey paper it does not introduce a novel model or empirical advance, limiting its breakthrough score to the solid-contribution range.

Methodology

Identify and categorize existing LLM agent QA systems by contrasting them with traditional agents and naive LLM QA baselines to establish the motivation for the agent paradigm
Systematically review agent designs across four stages — planning (task decomposition, tool selection), question understanding (intent parsing, disambiguation), information retrieval (tool use, RAG, web search), and answer generation (synthesis, grounding)
Analyze open challenges such as hallucination, multi-hop reasoning, tool reliability, and efficiency, then derive future research directions for the community

System Components

Planning Module

Handles task decomposition, sub-question generation, and tool/strategy selection to guide the agent's overall problem-solving trajectory

Question Understanding

Parses user intent, resolves ambiguity, and reformulates complex or multi-part questions into forms amenable to downstream retrieval and reasoning

Information Retrieval

Interfaces with external environments including knowledge bases, search engines, APIs, and documents to fetch relevant evidence beyond the LLM's parametric knowledge

Answer Generation

Synthesizes retrieved evidence and intermediate reasoning steps into a final, grounded answer, handling aggregation across multiple sources or reasoning hops

LLM Reasoning Core

Serves as the central reasoning engine that coordinates all other components, leveraging in-context learning and instruction-following to flexibly adapt to diverse QA tasks

Results

Aspect	Traditional QA Agents	Naive LLM QA	LLM Agent QA
Data Requirements	High (task-specific labeled data)	Low (zero/few-shot)	Low (zero/few-shot)
Generalization	Limited to trained domains	Moderate (parametric knowledge)	High (external env. interaction)
External Knowledge Access	Structured, limited	None (closed-book)	Dynamic via tools/retrieval
Multi-hop Reasoning	Difficult	Moderate	Strong (via planning)
Overall QA Performance	Baseline	Moderate improvement	Superior (per surveyed works)

Key Takeaways

When building production QA systems, structuring your LLM agent pipeline around the four stages (planning, question understanding, retrieval, generation) provides a principled design checklist that maps to the current research frontier
Retrieval-augmented and tool-using LLM agents consistently outperform closed-book LLMs on knowledge-intensive QA, making external environment integration a near-mandatory design choice for high-accuracy applications
Key open challenges — multi-hop reasoning reliability, tool selection errors, hallucination in synthesis, and computational efficiency — represent the highest-leverage areas for research investment and engineering mitigation in deployed LLM QA systems

Abstract

This paper surveys the development of large language model (LLM)-based agents for question answering (QA). Traditional agents face significant limitations, including substantial data requirements and difficulty in generalizing to new environments. LLM-based agents address these challenges by leveraging LLMs as their core reasoning engine. These agents achieve superior QA results compared to traditional QA pipelines and naive LLM QA systems by enabling interaction with external environments. We systematically review the design of LLM agents in the context of QA tasks, organizing our discussion across key stages: planning, question understanding, information retrieval, and answer generation. Additionally, this paper identifies ongoing challenges and explores future research directions to enhance the performance of LLM agent QA systems.