Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness
Problem Statement
LLMs lack domain-specific medical expertise out-of-the-box, making them unreliable for clinical decision support without customization. Preoperative assessments require precise, guideline-adherent responses where errors can directly harm patients. Existing deployments risk hallucinations and inconsistency, limiting clinical trust and adoption.
Key Novelty
- Comprehensive head-to-head evaluation of 10 diverse LLMs (GPT-3.5, GPT-4, GPT-4o, Gemini, Llama2, Llama3, Claude) under a unified RAG framework for a high-stakes medical task
- Benchmarking LLM-RAG outputs against 448 human-generated clinical responses across 14 real-world clinical scenarios using both local and international guidelines
- Demonstration that RAG eliminates hallucinations in a medical domain while achieving accuracy and consistency surpassing human experts
Evaluation Highlights
- GPT-4 LLM-RAG with international guidelines achieved 96.4% accuracy vs. 86.6% for human-generated responses (p=0.016), with zero hallucinations
- Responses generated within 20 seconds, with higher output consistency than human experts across 3,234 total generated responses
Breakthrough Assessment
Methodology
- Curate a RAG knowledge base from 35 local and 23 international preoperative/surgical fitness guidelines, then index them for retrieval
- Design 14 clinical scenarios covering diverse preoperative assessment cases and generate queries; retrieve relevant guideline chunks and augment LLM prompts accordingly
- Generate responses from 10 LLMs under the RAG framework, then evaluate 3,234 responses for accuracy, consistency, hallucination rate, and latency against 448 clinician-generated gold-standard answers
System Components
Indexed repository of 35 local and 23 international surgical fitness and preoperative care guidelines used for context retrieval
Fetches relevant guideline passages based on clinical query to ground LLM responses in authoritative medical knowledge
Ten LLMs (GPT-3.5, GPT-4, GPT-4o, Gemini, Llama2, Llama3, Claude, etc.) that generate augmented responses from retrieved context
14 clinical scenarios with 448 human-generated reference answers used to benchmark accuracy, consistency, hallucination, and latency
Results
| Metric | Human Baseline | GPT-4 LLM-RAG | Delta |
|---|---|---|---|
| Accuracy | 86.6% | 96.4% | +9.8% (p=0.016) |
| Hallucination Rate | Not reported | 0% (absent) | Eliminated |
| Response Consistency | Lower | Higher | Qualitative improvement |
| Response Latency | Human time scale | <20 seconds | Dramatically faster |
Key Takeaways
- RAG is a viable and effective strategy for deploying LLMs in high-stakes medical decision support, grounding outputs in authoritative guidelines and eliminating hallucinations
- GPT-4 consistently outperforms other tested LLMs under RAG for clinical accuracy, making model selection critical when building medical AI pipelines
- When evaluating LLM systems for clinical use, benchmarking against human expert performance—not just inter-model comparison—is essential to establish real-world utility and safety
Abstract
Large Language Models (LLMs) hold promise for medical applications but often lack domain-specific expertise. Retrieval Augmented Generation (RAG) enables customization by integrating specialized knowledge. This study assessed the accuracy, consistency, and safety of LLM-RAG models in determining surgical fitness and delivering preoperative instructions using 35 local and 23 international guidelines. Ten LLMs (e.g., GPT3.5, GPT4, GPT4o, Gemini, Llama2, and Llama3, Claude) were tested across 14 clinical scenarios. A total of 3234 responses were generated and compared to 448 human-generated answers. The GPT4 LLM-RAG model with international guidelines generated answers within 20 s and achieved the highest accuracy, which was significantly better than human-generated responses (96.4% vs. 86.6%, p = 0.016). Additionally, the model exhibited an absence of hallucinations and produced more consistent output than humans. This study underscores the potential of GPT-4-based LLM-RAG models to deliver highly accurate, efficient, and consistent preoperative assessments.