Inference Scaled GraphRAG: Improving Multi Hop Question Answering on Knowledge Graphs
Problem Statement
LLMs struggle with knowledge-intensive multi-hop reasoning because they lack structured context and relational graph traversal capabilities. Standard RAG methods retrieve flat text chunks and fail to capture the relational structure between nodes in knowledge graphs. Existing GraphRAG approaches do not leverage inference-time compute scaling, leaving significant reasoning performance on the table.
Key Novelty
- Sequential inference-time scaling via deep chain-of-thought graph traversal that incrementally expands reasoning across knowledge graph hops
- Parallel inference-time scaling using majority voting over multiple independently sampled graph traversal trajectories
- An interleaved reasoning-execution loop that tightly couples LLM reasoning steps with live graph traversal actions, enabling dynamic, context-aware retrieval
Evaluation Highlights
- Significant multi-hop QA performance gains over traditional GraphRAG baselines on the GRBench benchmark
- Outperforms prior graph traversal baselines, demonstrating that inference-time scaling is architecture-agnostic and broadly applicable
Breakthrough Assessment
Methodology
- Step 1 — Graph Grounding: Given a natural language question, identify relevant seed entities in the knowledge graph and initialize the reasoning-execution loop with those anchor nodes.
- Step 2 — Sequential Scaling (Chain-of-Thought Traversal): At each reasoning step, the LLM generates a chain-of-thought that decides which graph edges/relations to follow next; the execution layer fetches the corresponding neighboring nodes and facts, iterating until sufficient multi-hop context is gathered.
- Step 3 — Parallel Scaling (Majority Voting): Multiple independent traversal trajectories are sampled in parallel, each producing a candidate answer; a majority vote aggregates these into a final, more reliable answer.
System Components
Extends the LLM's chain-of-thought reasoning depth by iteratively traversing knowledge graph hops, allowing the model to gather multi-hop relational evidence before answering.
Samples multiple independent reasoning-execution trajectories and applies majority voting to reduce variance and improve answer reliability.
A tight coupling mechanism where each LLM reasoning step triggers a concrete graph query/traversal action, and the retrieved subgraph context is fed back into the next reasoning step.
A multi-hop QA benchmark over knowledge graphs used to measure performance across different graph domains and hop complexities.
Results
| Metric/Benchmark | Baseline (GraphRAG) | This Paper (IS-GraphRAG) | Delta |
|---|---|---|---|
| Multi-hop QA Accuracy (GRBench) | Lower (not specified) | Substantially higher | Significant gain |
| Multi-hop QA vs. Graph Traversal Baselines | Prior SOTA traversal | Outperforms | Positive improvement |
| Architecture Generality | Model-specific | Architecture-agnostic | Qualitative advantage |
Key Takeaways
- Inference-time compute scaling (both sequential depth and parallel sampling) is a practical, plug-and-play strategy to boost GraphRAG performance without retraining or fine-tuning models.
- Interleaving LLM reasoning with live graph execution steps is more effective than pre-retrieving a static subgraph, especially for complex multi-hop questions where the relevant context depends on prior reasoning steps.
- Majority voting over multiple graph traversal trajectories is a cost-effective way to reduce reasoning errors in structured QA—practitioners should consider this when deploying knowledge graph QA systems where reliability matters more than latency.
Abstract
Large Language Models (LLMs) have achieved impressive capabilities in language understanding and generation, yet they continue to underperform on knowledge-intensive reasoning tasks due to limited access to structured context and multi-hop information. Retrieval-Augmented Generation (RAG) partially mitigates this by grounding generation in retrieved context, but conventional RAG and GraphRAG methods often fail to capture relational structure across nodes in knowledge graphs. We introduce Inference-Scaled GraphRAG, a novel framework that enhances LLM-based graph reasoning by applying inference-time compute scaling. Our method combines sequential scaling with deep chain-of-thought graph traversal, and parallel scaling with majority voting over sampled trajectories within an interleaved reasoning-execution loop. Experiments on the GRBench benchmark demonstrate that our approach significantly improves multi-hop question answering performance, achieving substantial gains over both traditional GraphRAG and prior graph traversal baselines. These findings suggest that inference-time scaling is a practical and architecture-agnostic solution for structured knowledge reasoning with LLMs