Retrieval augmented generation for large language models in healthcare: A systematic review
Problem Statement
LLMs in healthcare suffer from outdated training data, hallucination risks, and lack of transparency — all critical issues in clinical settings. RAG offers a promising mitigation strategy, but there is no systematic understanding of which RAG approaches, datasets, and evaluation frameworks exist in healthcare. This gap hinders responsible and evidence-based adoption of RAG-based systems in medical practice.
Key Novelty
- First systematic review specifically mapping RAG methodologies (Naive, Advanced, Modular) as applied to healthcare LLMs, covering retrieval, augmentation, and generation stages
- Comprehensive cataloging of healthcare datasets used for RAG, revealing strong English-language dominance (78.9%) and significant Chinese representation (21.1%), exposing multilingual gaps
- Identification of the absence of standardized RAG evaluation frameworks in healthcare and the widespread neglect of ethical considerations in existing studies
Evaluation Highlights
- 78.9% of studies used English-language datasets; 21.1% used Chinese datasets, indicating severe underrepresentation of other languages
- Proprietary models (GPT-3.5/4) dominate RAG healthcare applications, with no standardized evaluation benchmark identified across studies
Breakthrough Assessment
Methodology
- Systematic literature search and selection of studies applying RAG-based LLMs in healthcare, following structured review protocols (PRISMA or equivalent)
- Taxonomic analysis of RAG pipeline components across three dimensions: retrieval strategies, augmentation techniques, and generation approaches (Naive, Advanced, Modular RAG)
- Synthesis of findings across dataset characteristics, model choices, evaluation frameworks, and ethical considerations to identify patterns, strengths, and critical gaps
System Components
Examines how external knowledge sources (medical databases, EHRs, literature) are indexed and queried to ground LLM responses
Reviews how retrieved context is integrated into LLM prompts, including chunking, reranking, and context fusion strategies
Assesses how LLMs produce final outputs conditioned on retrieved context, covering model choices and output quality
Classifies approaches into Naive RAG (simple retrieve-then-generate), Advanced RAG (iterative/reranking), and Modular RAG (pipeline-flexible architectures)
Surveys metrics and benchmarks used across studies, finding a lack of standardization in how RAG performance is measured in healthcare
Evaluates the extent to which studies address ethical challenges such as bias, privacy, and clinical safety — finding widespread neglect
Results
| Dimension | Prior State | Review Finding | Gap/Delta |
|---|---|---|---|
| Dataset Language Coverage | Assumed diverse | 78.9% English, 21.1% Chinese | Severe underrepresentation of non-English/non-Chinese languages |
| Dominant LLM Type | Unknown/mixed | Proprietary (GPT-3.5/4) most used | Open-source models underexplored for healthcare RAG |
| Evaluation Standardization | Assumed emerging | No standardized framework identified | Critical gap for reproducibility and clinical validation |
| Ethical Considerations | Assumed present | Majority of studies do not address ethics | Significant oversight for safety-critical clinical deployment |
Key Takeaways
- Practitioners building healthcare RAG systems should not rely on GPT-3.5/4 by default — the dominance of proprietary models reflects convenience over clinical suitability, and open-source alternatives warrant evaluation for privacy-sensitive deployments
- There is no community consensus on how to evaluate RAG in healthcare; teams should proactively define evaluation frameworks covering factuality, retrieval precision, clinical safety, and hallucination rate rather than borrowing generic NLP benchmarks
- Ethical and multilingual considerations are largely absent from current RAG healthcare research — responsible deployment requires explicit bias auditing, patient data privacy safeguards, and expansion beyond English/Chinese clinical corpora
Abstract
Large Language Models (LLMs) have demonstrated promising capabilities to solve complex tasks in critical sectors such as healthcare. However, LLMs are limited by their training data which is often outdated, the tendency to generate inaccurate (“hallucinated”) content and a lack of transparency in the content they generate. To address these limitations, retrieval augmented generation (RAG) grounds the responses of LLMs by exposing them to external knowledge sources. However, in the healthcare domain there is currently a lack of systematic understanding of which datasets, RAG methodologies and evaluation frameworks are available. This review aims to bridge this gap by assessing RAG-based approaches employed by LLMs in healthcare, focusing on the different steps of retrieval, augmentation and generation. Additionally, we identify the limitations, strengths and gaps in the existing literature. Our synthesis shows that 78.9% of studies used English datasets and 21.1% of the datasets are in Chinese. We find that a range of techniques are employed RAG-based LLMs in healthcare, including Naive RAG, Advanced RAG, and Modular RAG. Surprisingly, proprietary models such as GPT-3.5/4 are the most used for RAG applications in healthcare. We find that there is a lack of standardised evaluation frameworks for RAG-based applications. In addition, the majority of the studies do not assess or address ethical considerations related to RAG in healthcare. It is important to account for ethical challenges that are inherent when AI systems are implemented in the clinical setting. Lastly, we highlight the need for further research and development to ensure responsible and effective adoption of RAG in the medical domain.