Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration

Collab-RAG is a collaborative training framework that pairs a trainable small language model (SLM) for query decomposition with a black-box large language model (LLM) for reasoning, enabling mutual improvement in multi-hop RAG pipelines without access to frontier LLM weights.

Problem Statement

Standard RAG systems struggle with multi-hop questions because a single complex query often retrieves irrelevant documents and overwhelms the reasoning capacity of the generator. Existing solutions either rely on expensive frontier LLMs for distillation or fine-tune SLMs in isolation without leveraging the feedback loop available from capable black-box LLMs. This leaves a practical gap for teams who need strong multi-hop QA performance with affordable, deployable models.

Key Novelty

A mutual-enhancement training loop where the black-box LLM provides feedback signals to supervise and improve the SLM's sub-question decomposition without requiring distillation from frontier models like GPT-4.
A white-box/black-box collaboration paradigm that separates concerns: the SLM handles query decomposition (trainable, transparent) while the black-box LLM handles complex reasoning over retrieved context.
Demonstrated generalization across multiple different black-box LLMs, showing the framework is not tied to a single commercial API and that a 3B SLM fine-tuned this way surpasses a frozen 32B LLM on decomposition quality.

Evaluation Highlights

Collab-RAG outperforms existing black-box-only and SLM fine-tuning baselines by 1.8%–14.2% on average across five multi-hop QA datasets.
A 3B parameter fine-tuned SLM surpasses a frozen 32B LLM in question decomposition quality, demonstrating high parameter efficiency of the collaborative training approach.

Breakthrough Assessment

6/10 Collab-RAG presents a solid, practically motivated contribution by formalizing a cost-effective white-box/black-box collaboration loop for RAG, but the core components (query decomposition, iterative feedback, RAG) are individually well-established; the novelty lies in their integration and the mutual-training dynamic rather than a paradigm shift.

Methodology

The SLM decomposes an incoming complex multi-hop question into a sequence of simpler sub-questions, each of which is used to independently retrieve relevant documents from the knowledge source.
The black-box LLM receives the sub-questions along with their retrieved contexts and generates intermediate and final answers, also producing feedback signals (e.g., correctness, coherence of decomposition) that are sent back to the SLM.
The SLM is fine-tuned using the feedback signals from the black-box LLM as supervision, iteratively improving its decomposition capability without requiring labeled decomposition data from frontier models.

System Components

White-Box SLM Decomposer

A small (e.g., 3B parameter), fully trainable language model that breaks complex queries into ordered sub-questions to improve retrieval precision and simplify downstream reasoning.

Retrieval Module

A standard document retriever that fetches relevant passages for each sub-question individually, reducing noise compared to retrieving for the full complex query at once.

Black-Box LLM Reasoner

A frozen, API-accessible large language model that synthesizes retrieved sub-question contexts into final answers and generates feedback on decomposition quality for SLM training.

Feedback-Driven Training Loop

A training pipeline that uses the black-box LLM's feedback signals as supervision to fine-tune the SLM's decomposition policy, replacing the need for expensive frontier LLM distillation.

Results

Benchmark	Best Baseline (Black-box / SLM FT)	Collab-RAG	Delta
Avg. over 5 multi-hop QA datasets	Baseline methods	Collab-RAG	+1.8% to +14.2%
Question Decomposition Quality	Frozen 32B LLM	Fine-tuned 3B SLM (Collab-RAG)	3B surpasses 32B frozen
Generalization across black-box LLMs	Single-LLM tuned methods	Collab-RAG	Strong cross-LLM generalization

Key Takeaways

Practitioners can achieve strong multi-hop QA performance with a small (3B) fine-tuned decomposer paired with any capable black-box LLM API, avoiding the cost and access requirements of frontier model distillation.
Separating query decomposition (white-box, trainable) from final reasoning (black-box, frozen) is an effective architectural pattern for RAG systems dealing with complex, multi-step questions.
The framework generalizes across multiple black-box LLMs, making it a flexible drop-in improvement for teams already using commercial LLM APIs, with the open-source SLM component being the only trained artifact to maintain.

Abstract

Retrieval-Augmented Generation (RAG) systems often struggle to handle multi-hop question-answering tasks accurately due to irrelevant context retrieval and limited complex reasoning capabilities. We introduce Collab-RAG, a collaborative training framework that leverages mutual enhancement between a white-box small language model (SLM) and a blackbox large language model (LLM) for RAG. Specifically, the SLM decomposes complex queries into simpler sub-questions, thus enhancing the accuracy of the retrieval and facilitating more effective reasoning by the black-box LLM. Concurrently, the black-box LLM provides feedback signals to improve the SLM's decomposition capability. We observe that Collab-RAG relies solely on supervision from an affordable black-box LLM without additional distillation from frontier LLMs, yet demonstrates strong generalization across multiple black-box LLMs. Experimental evaluations across five multi-hop QA datasets demonstrate that Collab-RAG substantially outperforms existing black-box-only and SLM fine-tuning baselines by 1.8%-14.2% on average. In particular, our fine-tuned 3B SLM surpasses a frozen 32B LLM in question decomposition, highlighting the efficiency of Collab-RAG in improving reasoning and retrieval for complex questions. The code of Collab-RAG is available on https://github.com/ritaranx/Collab-RAG/.