Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge

This paper systematically compares parametric (unsupervised and supervised fine-tuning) versus non-parametric (RAG) knowledge injection methods for multi-hop question answering, particularly under temporally novel knowledge conditions, finding that supervised fine-tuning achieves highest accuracy while RAG provides the most robust gains for novel knowledge.

Problem Statement

Multi-hop question answering requires integrating multiple knowledge pieces, yet the relative effectiveness of different knowledge injection mechanisms for LLMs remains poorly understood. Existing comparisons fail to adequately address scenarios where required knowledge is temporally novel (beyond pretraining cutoffs), which is increasingly relevant as LLMs are deployed on current events. This gap leaves practitioners without clear guidance on when to use fine-tuning versus retrieval-based approaches.

Key Novelty

Construction of a new benchmark of 10,000+ multi-hop questions derived from 2024 Wikipedia events, specifically designed to probe knowledge beyond LLM pretraining cutoffs
Systematic three-way comparison of unsupervised fine-tuning (continual pretraining), supervised fine-tuning, and RAG across three 7B-parameter open-source LLMs on both standard and temporally novel benchmarks
Empirical finding that unsupervised/continual pretraining provides only limited multi-hop reasoning gains, challenging its common use as a knowledge injection strategy

Evaluation Highlights

Supervised fine-tuning achieves the highest overall accuracy across all three 7B models and both benchmarks (QASC and novel 2024 Wikipedia multi-hop dataset)
RAG yields substantial and consistent improvements over base models—especially on temporally novel questions—while unsupervised fine-tuning shows only marginal gains over base model performance

Signal Assessment

5/10 The paper delivers a solid, practically useful empirical comparison with a well-designed novel benchmark, but its core contribution is confirmatory/comparative rather than introducing a new method or architectural advance; it solidifies practitioner intuitions with rigorous evidence.

Methodology

Select three 7B open-source LLMs and prepare two benchmarks: QASC (standard multi-hop science QA) and a newly constructed dataset of 10,000+ multi-hop questions from 2024 Wikipedia events to test temporally novel knowledge
Apply three knowledge injection conditions to each model: (1) unsupervised fine-tuning via continual pretraining on relevant corpora, (2) supervised fine-tuning on labeled QA pairs, and (3) RAG using retrieved context passages at inference time
Evaluate all model-method combinations on both benchmarks using accuracy metrics, analyzing performance differences across knowledge injection strategies and knowledge novelty conditions

System Components

Unsupervised Fine-Tuning (Continual Pretraining)

Continues LLM pretraining on domain-relevant unlabeled text to inject new knowledge into model parameters without task-specific supervision

Supervised Fine-Tuning (SFT)

Fine-tunes LLMs on labeled multi-hop QA pairs with correct answers, directly optimizing for task performance and knowledge integration

Retrieval-Augmented Generation (RAG)

At inference time, retrieves relevant context passages from an external knowledge source and provides them as input to the LLM to support answer generation

2024 Wikipedia Multi-Hop Benchmark

A newly constructed dataset of 10,000+ multi-hop questions derived from Wikipedia events after typical LLM pretraining cutoffs, enabling evaluation of knowledge injection under temporal novelty constraints

QASC Benchmark

An existing standard multi-hop science question answering dataset used as a controlled reference point for comparing knowledge injection methods

Results

Condition	Unsupervised FT (Continual Pretraining)	RAG	Supervised FT
QASC (standard multi-hop)	Marginal gain over base	Substantial improvement	Highest accuracy
2024 Wikipedia (novel knowledge)	Marginal gain over base	Substantial & consistent improvement	Highest accuracy
Robustness to novel knowledge	Low	High	High

Key Takeaways

For production systems requiring up-to-date or novel knowledge (e.g., current events), RAG is the most practical and reliable approach—it consistently improves multi-hop accuracy without the cost and rigidity of retraining
Supervised fine-tuning is the strongest overall strategy when labeled QA data is available and the knowledge domain is stable; practitioners should prioritize it when building specialized multi-hop QA systems with fixed knowledge scopes
Continual pretraining alone (unsupervised fine-tuning) should not be relied upon to improve multi-hop reasoning accuracy; if the goal is knowledge injection for QA tasks, it must be paired with supervised objectives or retrieval to be effective

Abstract

Multi-hop question answering is widely used to evaluate the reasoning capabilities of large language models (LLMs), as it requires integrating multiple pieces of supporting knowledge to arrive at a correct answer. While prior work has explored different mechanisms for providing knowledge to LLMs, such as finetuning and retrieval-augmented generation (RAG), their relative effectiveness for multi-hop question answering remains insufficiently understood, particularly when the required knowledge is temporally novel. In this paper, we systematically compare parametric and non-parametric knowledge injection methods for open-domain multi-hop question answering. We evaluate unsupervised fine-tuning (continual pretraining), supervised fine-tuning, and retrieval-augmented generation across three 7B-parameter open-source LLMs. Experiments are conducted on two benchmarks: QASC, a standard multi-hop science question answering dataset, and a newly constructed dataset of over 10,000 multi-hop questions derived from Wikipedia events in 2024, designed to test knowledge beyond the models'pretraining cutoff. Our results show that unsupervised fine-tuning provides only limited gains over base models, suggesting that continual pretraining alone is insufficient for improving multi-hop reasoning accuracy. In contrast, retrieval-augmented generation yields substantial and consistent improvements, particularly when answering questions that rely on temporally novel information. Supervised fine-tuning achieves the highest overall accuracy across models and datasets. These findings highlight fundamental differences in how knowledge injection mechanisms support multi-hop question answering and underscore the importance of retrieval-based methods when external or compositional knowledge is required.