Retrieval augmented generation for large language models in healthcare: A systematic review

This systematic review comprehensively assesses Retrieval Augmented Generation (RAG) methodologies applied to LLMs in healthcare, cataloging datasets, techniques, and evaluation frameworks while identifying critical gaps in standardization and ethical consideration.

Problem Statement

LLMs in healthcare suffer from outdated training data, hallucination risks, and lack of transparency — all critical issues in clinical settings. RAG offers a promising mitigation strategy, but there is no systematic understanding of which RAG approaches, datasets, and evaluation frameworks exist in healthcare. This gap hinders responsible and evidence-based adoption of RAG-based systems in medical practice.

Key Novelty

First systematic review specifically mapping RAG methodologies (Naive, Advanced, Modular) as applied to healthcare LLMs, covering retrieval, augmentation, and generation stages
Comprehensive cataloging of healthcare datasets used for RAG, revealing strong English-language dominance (78.9%) and significant Chinese representation (21.1%), exposing multilingual gaps
Identification of the absence of standardized RAG evaluation frameworks in healthcare and the widespread neglect of ethical considerations in existing studies

Evaluation Highlights

78.9% of studies used English-language datasets; 21.1% used Chinese datasets, indicating severe underrepresentation of other languages
Proprietary models (GPT-3.5/4) dominate RAG healthcare applications, with no standardized evaluation benchmark identified across studies

Breakthrough Assessment

4/10 This is a solid and timely systematic review that consolidates a fragmented landscape and surfaces important gaps, but it is primarily a synthesis contribution rather than a novel technical advance, limiting its breakthrough score.

Methodology

Systematic literature search and selection of studies applying RAG-based LLMs in healthcare, following structured review protocols (PRISMA or equivalent)
Taxonomic analysis of RAG pipeline components across three dimensions: retrieval strategies, augmentation techniques, and generation approaches (Naive, Advanced, Modular RAG)
Synthesis of findings across dataset characteristics, model choices, evaluation frameworks, and ethical considerations to identify patterns, strengths, and critical gaps

System Components

Retrieval Stage Analysis

Examines how external knowledge sources (medical databases, EHRs, literature) are indexed and queried to ground LLM responses

Augmentation Stage Analysis

Reviews how retrieved context is integrated into LLM prompts, including chunking, reranking, and context fusion strategies

Generation Stage Analysis

Assesses how LLMs produce final outputs conditioned on retrieved context, covering model choices and output quality

RAG Taxonomy

Classifies approaches into Naive RAG (simple retrieve-then-generate), Advanced RAG (iterative/reranking), and Modular RAG (pipeline-flexible architectures)

Evaluation Framework Audit

Surveys metrics and benchmarks used across studies, finding a lack of standardization in how RAG performance is measured in healthcare

Ethics & Bias Assessment

Evaluates the extent to which studies address ethical challenges such as bias, privacy, and clinical safety — finding widespread neglect

Results

Dimension	Prior State	Review Finding	Gap/Delta
Dataset Language Coverage	Assumed diverse	78.9% English, 21.1% Chinese	Severe underrepresentation of non-English/non-Chinese languages
Dominant LLM Type	Unknown/mixed	Proprietary (GPT-3.5/4) most used	Open-source models underexplored for healthcare RAG
Evaluation Standardization	Assumed emerging	No standardized framework identified	Critical gap for reproducibility and clinical validation
Ethical Considerations	Assumed present	Majority of studies do not address ethics	Significant oversight for safety-critical clinical deployment

Key Takeaways

Practitioners building healthcare RAG systems should not rely on GPT-3.5/4 by default — the dominance of proprietary models reflects convenience over clinical suitability, and open-source alternatives warrant evaluation for privacy-sensitive deployments
There is no community consensus on how to evaluate RAG in healthcare; teams should proactively define evaluation frameworks covering factuality, retrieval precision, clinical safety, and hallucination rate rather than borrowing generic NLP benchmarks
Ethical and multilingual considerations are largely absent from current RAG healthcare research — responsible deployment requires explicit bias auditing, patient data privacy safeguards, and expansion beyond English/Chinese clinical corpora

Abstract

Large Language Models (LLMs) have demonstrated promising capabilities to solve complex tasks in critical sectors such as healthcare. However, LLMs are limited by their training data which is often outdated, the tendency to generate inaccurate (“hallucinated”) content and a lack of transparency in the content they generate. To address these limitations, retrieval augmented generation (RAG) grounds the responses of LLMs by exposing them to external knowledge sources. However, in the healthcare domain there is currently a lack of systematic understanding of which datasets, RAG methodologies and evaluation frameworks are available. This review aims to bridge this gap by assessing RAG-based approaches employed by LLMs in healthcare, focusing on the different steps of retrieval, augmentation and generation. Additionally, we identify the limitations, strengths and gaps in the existing literature. Our synthesis shows that 78.9% of studies used English datasets and 21.1% of the datasets are in Chinese. We find that a range of techniques are employed RAG-based LLMs in healthcare, including Naive RAG, Advanced RAG, and Modular RAG. Surprisingly, proprietary models such as GPT-3.5/4 are the most used for RAG applications in healthcare. We find that there is a lack of standardised evaluation frameworks for RAG-based applications. In addition, the majority of the studies do not assess or address ethical considerations related to RAG in healthcare. It is important to account for ethical challenges that are inherent when AI systems are implemented in the clinical setting. Lastly, we highlight the need for further research and development to ensure responsible and effective adoption of RAG in the medical domain.