Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration
Problem Statement
Standard RAG systems struggle with multi-hop questions because a single complex query often retrieves irrelevant documents and overwhelms the reasoning capacity of the generator. Existing solutions either rely on expensive frontier LLMs for distillation or fine-tune SLMs in isolation without leveraging the feedback loop available from capable black-box LLMs. This leaves a practical gap for teams who need strong multi-hop QA performance with affordable, deployable models.
Key Novelty
- A mutual-enhancement training loop where the black-box LLM provides feedback signals to supervise and improve the SLM's sub-question decomposition without requiring distillation from frontier models like GPT-4.
- A white-box/black-box collaboration paradigm that separates concerns: the SLM handles query decomposition (trainable, transparent) while the black-box LLM handles complex reasoning over retrieved context.
- Demonstrated generalization across multiple different black-box LLMs, showing the framework is not tied to a single commercial API and that a 3B SLM fine-tuned this way surpasses a frozen 32B LLM on decomposition quality.
Evaluation Highlights
- Collab-RAG outperforms existing black-box-only and SLM fine-tuning baselines by 1.8%–14.2% on average across five multi-hop QA datasets.
- A 3B parameter fine-tuned SLM surpasses a frozen 32B LLM in question decomposition quality, demonstrating high parameter efficiency of the collaborative training approach.
Breakthrough Assessment
Methodology
- The SLM decomposes an incoming complex multi-hop question into a sequence of simpler sub-questions, each of which is used to independently retrieve relevant documents from the knowledge source.
- The black-box LLM receives the sub-questions along with their retrieved contexts and generates intermediate and final answers, also producing feedback signals (e.g., correctness, coherence of decomposition) that are sent back to the SLM.
- The SLM is fine-tuned using the feedback signals from the black-box LLM as supervision, iteratively improving its decomposition capability without requiring labeled decomposition data from frontier models.
System Components
A small (e.g., 3B parameter), fully trainable language model that breaks complex queries into ordered sub-questions to improve retrieval precision and simplify downstream reasoning.
A standard document retriever that fetches relevant passages for each sub-question individually, reducing noise compared to retrieving for the full complex query at once.
A frozen, API-accessible large language model that synthesizes retrieved sub-question contexts into final answers and generates feedback on decomposition quality for SLM training.
A training pipeline that uses the black-box LLM's feedback signals as supervision to fine-tune the SLM's decomposition policy, replacing the need for expensive frontier LLM distillation.
Results
| Benchmark | Best Baseline (Black-box / SLM FT) | Collab-RAG | Delta |
|---|---|---|---|
| Avg. over 5 multi-hop QA datasets | Baseline methods | Collab-RAG | +1.8% to +14.2% |
| Question Decomposition Quality | Frozen 32B LLM | Fine-tuned 3B SLM (Collab-RAG) | 3B surpasses 32B frozen |
| Generalization across black-box LLMs | Single-LLM tuned methods | Collab-RAG | Strong cross-LLM generalization |
Key Takeaways
- Practitioners can achieve strong multi-hop QA performance with a small (3B) fine-tuned decomposer paired with any capable black-box LLM API, avoiding the cost and access requirements of frontier model distillation.
- Separating query decomposition (white-box, trainable) from final reasoning (black-box, frozen) is an effective architectural pattern for RAG systems dealing with complex, multi-step questions.
- The framework generalizes across multiple black-box LLMs, making it a flexible drop-in improvement for teams already using commercial LLM APIs, with the open-source SLM component being the only trained artifact to maintain.
Abstract
Retrieval-Augmented Generation (RAG) systems often struggle to handle multi-hop question-answering tasks accurately due to irrelevant context retrieval and limited complex reasoning capabilities. We introduce Collab-RAG, a collaborative training framework that leverages mutual enhancement between a white-box small language model (SLM) and a blackbox large language model (LLM) for RAG. Specifically, the SLM decomposes complex queries into simpler sub-questions, thus enhancing the accuracy of the retrieval and facilitating more effective reasoning by the black-box LLM. Concurrently, the black-box LLM provides feedback signals to improve the SLM's decomposition capability. We observe that Collab-RAG relies solely on supervision from an affordable black-box LLM without additional distillation from frontier LLMs, yet demonstrates strong generalization across multiple black-box LLMs. Experimental evaluations across five multi-hop QA datasets demonstrate that Collab-RAG substantially outperforms existing black-box-only and SLM fine-tuning baselines by 1.8%-14.2% on average. In particular, our fine-tuned 3B SLM surpasses a frozen 32B LLM in question decomposition, highlighting the efficiency of Collab-RAG in improving reasoning and retrieval for complex questions. The code of Collab-RAG is available on https://github.com/ritaranx/Collab-RAG/.