Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness

Yuhe Ke, Liyuan Jin, Kabilan Elangovan, H. Abdullah, Nan Liu, Alex Tiong Heng Sia, C. R. Soh, Joshua Yi Min Tung, J. Ong, C. Kuo, Shao-Chun Wu, V. Kovacheva, D. Ting

npj Digital Medicine | 2025

Semantic Scholar

LLMs RAG

RAG-augmented LLMs, particularly GPT-4, can outperform human clinicians in accuracy and consistency when assessing surgical fitness and delivering preoperative instructions using standardized medical guidelines.

Problem Statement

LLMs lack domain-specific medical expertise out-of-the-box, making them unreliable for clinical decision support without customization. Preoperative assessments require precise, guideline-adherent responses where errors can directly harm patients. Existing deployments risk hallucinations and inconsistency, limiting clinical trust and adoption.

Key Novelty

Comprehensive head-to-head evaluation of 10 diverse LLMs (GPT-3.5, GPT-4, GPT-4o, Gemini, Llama2, Llama3, Claude) under a unified RAG framework for a high-stakes medical task
Benchmarking LLM-RAG outputs against 448 human-generated clinical responses across 14 real-world clinical scenarios using both local and international guidelines
Demonstration that RAG eliminates hallucinations in a medical domain while achieving accuracy and consistency surpassing human experts

Evaluation Highlights

GPT-4 LLM-RAG with international guidelines achieved 96.4% accuracy vs. 86.6% for human-generated responses (p=0.016), with zero hallucinations
Responses generated within 20 seconds, with higher output consistency than human experts across 3,234 total generated responses

Breakthrough Assessment

6/10 The study provides a rigorous, large-scale clinical validation of RAG across many LLMs in a safety-critical domain, demonstrating human-surpassing performance—a meaningful contribution. However, it is primarily an applied evaluation study rather than a methodological innovation in RAG or LLM architecture.

Methodology

Curate a RAG knowledge base from 35 local and 23 international preoperative/surgical fitness guidelines, then index them for retrieval
Design 14 clinical scenarios covering diverse preoperative assessment cases and generate queries; retrieve relevant guideline chunks and augment LLM prompts accordingly
Generate responses from 10 LLMs under the RAG framework, then evaluate 3,234 responses for accuracy, consistency, hallucination rate, and latency against 448 clinician-generated gold-standard answers

System Components

RAG Knowledge Base

Indexed repository of 35 local and 23 international surgical fitness and preoperative care guidelines used for context retrieval

Retrieval Module

Fetches relevant guideline passages based on clinical query to ground LLM responses in authoritative medical knowledge

LLM Inference Layer

Ten LLMs (GPT-3.5, GPT-4, GPT-4o, Gemini, Llama2, Llama3, Claude, etc.) that generate augmented responses from retrieved context

Clinical Evaluation Framework

14 clinical scenarios with 448 human-generated reference answers used to benchmark accuracy, consistency, hallucination, and latency

Results

Metric	Human Baseline	GPT-4 LLM-RAG	Delta
Accuracy	86.6%	96.4%	+9.8% (p=0.016)
Hallucination Rate	Not reported	0% (absent)	Eliminated
Response Consistency	Lower	Higher	Qualitative improvement
Response Latency	Human time scale	<20 seconds	Dramatically faster

Key Takeaways

RAG is a viable and effective strategy for deploying LLMs in high-stakes medical decision support, grounding outputs in authoritative guidelines and eliminating hallucinations
GPT-4 consistently outperforms other tested LLMs under RAG for clinical accuracy, making model selection critical when building medical AI pipelines
When evaluating LLM systems for clinical use, benchmarking against human expert performance—not just inter-model comparison—is essential to establish real-world utility and safety

Abstract

Large Language Models (LLMs) hold promise for medical applications but often lack domain-specific expertise. Retrieval Augmented Generation (RAG) enables customization by integrating specialized knowledge. This study assessed the accuracy, consistency, and safety of LLM-RAG models in determining surgical fitness and delivering preoperative instructions using 35 local and 23 international guidelines. Ten LLMs (e.g., GPT3.5, GPT4, GPT4o, Gemini, Llama2, and Llama3, Claude) were tested across 14 clinical scenarios. A total of 3234 responses were generated and compared to 448 human-generated answers. The GPT4 LLM-RAG model with international guidelines generated answers within 20 s and achieved the highest accuracy, which was significantly better than human-generated responses (96.4% vs. 86.6%, p = 0.016). Additionally, the model exhibited an absence of hallucinations and produced more consistent output than humans. This study underscores the potential of GPT-4-based LLM-RAG models to deliver highly accurate, efficient, and consistent preoperative assessments.

Generated on 2026-03-03 using Claude