Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge
Problem Statement
Multi-hop question answering requires integrating multiple knowledge pieces, yet the relative effectiveness of different knowledge injection mechanisms for LLMs remains poorly understood. Existing comparisons fail to adequately address scenarios where required knowledge is temporally novel (beyond pretraining cutoffs), which is increasingly relevant as LLMs are deployed on current events. This gap leaves practitioners without clear guidance on when to use fine-tuning versus retrieval-based approaches.
Key Novelty
- Construction of a new benchmark of 10,000+ multi-hop questions derived from 2024 Wikipedia events, specifically designed to probe knowledge beyond LLM pretraining cutoffs
- Systematic three-way comparison of unsupervised fine-tuning (continual pretraining), supervised fine-tuning, and RAG across three 7B-parameter open-source LLMs on both standard and temporally novel benchmarks
- Empirical finding that unsupervised/continual pretraining provides only limited multi-hop reasoning gains, challenging its common use as a knowledge injection strategy
Evaluation Highlights
- Supervised fine-tuning achieves the highest overall accuracy across all three 7B models and both benchmarks (QASC and novel 2024 Wikipedia multi-hop dataset)
- RAG yields substantial and consistent improvements over base models—especially on temporally novel questions—while unsupervised fine-tuning shows only marginal gains over base model performance
Breakthrough Assessment
Methodology
- Select three 7B open-source LLMs and prepare two benchmarks: QASC (standard multi-hop science QA) and a newly constructed dataset of 10,000+ multi-hop questions from 2024 Wikipedia events to test temporally novel knowledge
- Apply three knowledge injection conditions to each model: (1) unsupervised fine-tuning via continual pretraining on relevant corpora, (2) supervised fine-tuning on labeled QA pairs, and (3) RAG using retrieved context passages at inference time
- Evaluate all model-method combinations on both benchmarks using accuracy metrics, analyzing performance differences across knowledge injection strategies and knowledge novelty conditions
System Components
Continues LLM pretraining on domain-relevant unlabeled text to inject new knowledge into model parameters without task-specific supervision
Fine-tunes LLMs on labeled multi-hop QA pairs with correct answers, directly optimizing for task performance and knowledge integration
At inference time, retrieves relevant context passages from an external knowledge source and provides them as input to the LLM to support answer generation
A newly constructed dataset of 10,000+ multi-hop questions derived from Wikipedia events after typical LLM pretraining cutoffs, enabling evaluation of knowledge injection under temporal novelty constraints
An existing standard multi-hop science question answering dataset used as a controlled reference point for comparing knowledge injection methods
Results
| Condition | Unsupervised FT (Continual Pretraining) | RAG | Supervised FT |
|---|---|---|---|
| QASC (standard multi-hop) | Marginal gain over base | Substantial improvement | Highest accuracy |
| 2024 Wikipedia (novel knowledge) | Marginal gain over base | Substantial & consistent improvement | Highest accuracy |
| Robustness to novel knowledge | Low | High | High |
Key Takeaways
- For production systems requiring up-to-date or novel knowledge (e.g., current events), RAG is the most practical and reliable approach—it consistently improves multi-hop accuracy without the cost and rigidity of retraining
- Supervised fine-tuning is the strongest overall strategy when labeled QA data is available and the knowledge domain is stable; practitioners should prioritize it when building specialized multi-hop QA systems with fixed knowledge scopes
- Continual pretraining alone (unsupervised fine-tuning) should not be relied upon to improve multi-hop reasoning accuracy; if the goal is knowledge injection for QA tasks, it must be paired with supervised objectives or retrieval to be effective
Abstract
Multi-hop question answering is widely used to evaluate the reasoning capabilities of large language models (LLMs), as it requires integrating multiple pieces of supporting knowledge to arrive at a correct answer. While prior work has explored different mechanisms for providing knowledge to LLMs, such as finetuning and retrieval-augmented generation (RAG), their relative effectiveness for multi-hop question answering remains insufficiently understood, particularly when the required knowledge is temporally novel. In this paper, we systematically compare parametric and non-parametric knowledge injection methods for open-domain multi-hop question answering. We evaluate unsupervised fine-tuning (continual pretraining), supervised fine-tuning, and retrieval-augmented generation across three 7B-parameter open-source LLMs. Experiments are conducted on two benchmarks: QASC, a standard multi-hop science question answering dataset, and a newly constructed dataset of over 10,000 multi-hop questions derived from Wikipedia events in 2024, designed to test knowledge beyond the models'pretraining cutoff. Our results show that unsupervised fine-tuning provides only limited gains over base models, suggesting that continual pretraining alone is insufficient for improving multi-hop reasoning accuracy. In contrast, retrieval-augmented generation yields substantial and consistent improvements, particularly when answering questions that rely on temporally novel information. Supervised fine-tuning achieves the highest overall accuracy across models and datasets. These findings highlight fundamental differences in how knowledge injection mechanisms support multi-hop question answering and underscore the importance of retrieval-based methods when external or compositional knowledge is required.