Inference Scaled GraphRAG: Improving Multi Hop Question Answering on Knowledge Graphs

Inference-Scaled GraphRAG enhances LLM-based multi-hop question answering over knowledge graphs by applying inference-time compute scaling through sequential chain-of-thought graph traversal and parallel majority voting over sampled reasoning trajectories.

Problem Statement

LLMs struggle with knowledge-intensive multi-hop reasoning because they lack structured context and relational graph traversal capabilities. Standard RAG methods retrieve flat text chunks and fail to capture the relational structure between nodes in knowledge graphs. Existing GraphRAG approaches do not leverage inference-time compute scaling, leaving significant reasoning performance on the table.

Key Novelty

Sequential inference-time scaling via deep chain-of-thought graph traversal that incrementally expands reasoning across knowledge graph hops
Parallel inference-time scaling using majority voting over multiple independently sampled graph traversal trajectories
An interleaved reasoning-execution loop that tightly couples LLM reasoning steps with live graph traversal actions, enabling dynamic, context-aware retrieval

Evaluation Highlights

Significant multi-hop QA performance gains over traditional GraphRAG baselines on the GRBench benchmark
Outperforms prior graph traversal baselines, demonstrating that inference-time scaling is architecture-agnostic and broadly applicable

Breakthrough Assessment

6/10 The paper makes a solid and practically relevant contribution by adapting inference-time compute scaling (a well-established LLM technique) to the GraphRAG setting in a principled way, but the core ideas of chain-of-thought traversal and majority voting are not fundamentally new—their novel combination and application to structured graph reasoning is the distinguishing factor.

Methodology

Step 1 — Graph Grounding: Given a natural language question, identify relevant seed entities in the knowledge graph and initialize the reasoning-execution loop with those anchor nodes.
Step 2 — Sequential Scaling (Chain-of-Thought Traversal): At each reasoning step, the LLM generates a chain-of-thought that decides which graph edges/relations to follow next; the execution layer fetches the corresponding neighboring nodes and facts, iterating until sufficient multi-hop context is gathered.
Step 3 — Parallel Scaling (Majority Voting): Multiple independent traversal trajectories are sampled in parallel, each producing a candidate answer; a majority vote aggregates these into a final, more reliable answer.

System Components

Inference-Time Sequential Scaler

Extends the LLM's chain-of-thought reasoning depth by iteratively traversing knowledge graph hops, allowing the model to gather multi-hop relational evidence before answering.

Inference-Time Parallel Scaler

Samples multiple independent reasoning-execution trajectories and applies majority voting to reduce variance and improve answer reliability.

Interleaved Reasoning-Execution Loop

A tight coupling mechanism where each LLM reasoning step triggers a concrete graph query/traversal action, and the retrieved subgraph context is fed back into the next reasoning step.

GRBench Evaluation Harness

A multi-hop QA benchmark over knowledge graphs used to measure performance across different graph domains and hop complexities.

Results

Metric/Benchmark	Baseline (GraphRAG)	This Paper (IS-GraphRAG)	Delta
Multi-hop QA Accuracy (GRBench)	Lower (not specified)	Substantially higher	Significant gain
Multi-hop QA vs. Graph Traversal Baselines	Prior SOTA traversal	Outperforms	Positive improvement
Architecture Generality	Model-specific	Architecture-agnostic	Qualitative advantage

Key Takeaways

Inference-time compute scaling (both sequential depth and parallel sampling) is a practical, plug-and-play strategy to boost GraphRAG performance without retraining or fine-tuning models.
Interleaving LLM reasoning with live graph execution steps is more effective than pre-retrieving a static subgraph, especially for complex multi-hop questions where the relevant context depends on prior reasoning steps.
Majority voting over multiple graph traversal trajectories is a cost-effective way to reduce reasoning errors in structured QA—practitioners should consider this when deploying knowledge graph QA systems where reliability matters more than latency.

Abstract

Large Language Models (LLMs) have achieved impressive capabilities in language understanding and generation, yet they continue to underperform on knowledge-intensive reasoning tasks due to limited access to structured context and multi-hop information. Retrieval-Augmented Generation (RAG) partially mitigates this by grounding generation in retrieved context, but conventional RAG and GraphRAG methods often fail to capture relational structure across nodes in knowledge graphs. We introduce Inference-Scaled GraphRAG, a novel framework that enhances LLM-based graph reasoning by applying inference-time compute scaling. Our method combines sequential scaling with deep chain-of-thought graph traversal, and parallel scaling with majority voting over sampled trajectories within an interleaved reasoning-execution loop. Experiments on the GRBench benchmark demonstrate that our approach significantly improves multi-hop question answering performance, achieving substantial gains over both traditional GraphRAG and prior graph traversal baselines. These findings suggest that inference-time scaling is a practical and architecture-agnostic solution for structured knowledge reasoning with LLMs