Panini: Continual Learning in Token Space via Structured Memory

Panini introduces a non-parametric continual learning framework where documents are converted into structured Generative Semantic Workspaces (GSW) — entity- and event-aware QA-pair networks — enabling efficient, reliable question answering without re-reading raw documents at inference time.

Problem Statement

RAG-based systems waste test-time compute by repeatedly feeding verbatim document chunks to LLMs for every query, and chunk retrieval often injects irrelevant context that leads to unsupported or hallucinated answers. Existing continual learning approaches typically require modifying model parameters, limiting flexibility and scalability. There is a need for a memory-efficient, modular approach that accumulates knowledge externally while keeping the base model fixed.

Key Novelty

Generative Semantic Workspaces (GSW): a structured external memory representing documents as entity- and event-aware networks of QA pairs rather than verbatim text chunks
Reasoning-grounded inference chains: at query time, Panini traverses the GSW graph to retrieve the most likely inference chains, enabling latent knowledge mining without raw document access
Continual consolidation: the GSW accumulates and consolidates itself incrementally as new documents arrive, realizing a human-like non-parametric memory that supports open-source, fixed-model pipelines

Evaluation Highlights

Panini achieves 5–7% higher average QA performance than competitive baselines across six QA benchmarks
Panini uses 2–30x fewer answer-context tokens than RAG-based methods while also reducing unsupported answers on curated unanswerable queries

Breakthrough Assessment

7/10 Panini presents a compelling and practically impactful rethinking of external memory for LLMs — trading raw chunk storage for structured semantic graphs — yielding simultaneous gains in accuracy, efficiency, and reliability. While the components (QA-pair graphs, inference chains, continual consolidation) individually build on prior ideas, their combination and demonstrated improvements across multiple benchmarks represent a significant advance over the dominant RAG paradigm.

Methodology

Write time — Document ingestion: each incoming document is parsed into a Generative Semantic Workspace (GSW), an entity- and event-aware graph of QA pairs that captures the key factual and relational content of the document
Memory consolidation: as new documents arrive, the GSW is continually updated and consolidated — merging overlapping entities/events and resolving conflicts — so the memory remains coherent without storing verbatim text
Read time — Query answering: given a user query, Panini traverses the GSW (not raw documents) to identify the most relevant inference chains of QA pairs and uses these compact, structured chains as context for the LLM to generate a grounded answer

System Components

Generative Semantic Workspace (GSW)

An external structured memory representing documents as a network of QA pairs organized around entities and events, enabling efficient storage and retrieval of semantic content without raw text

Entity- and Event-Aware Graph Construction

A write-time process that extracts entities and events from documents and links them through QA pairs to form a queryable knowledge graph

Reasoning-Grounded Inference Chains

At query time, paths through the GSW graph that connect relevant QA pairs into coherent reasoning chains, providing the LLM with minimal but sufficient context for answer generation

Continual Consolidation Module

Incrementally integrates new experiences (documents) into the existing GSW by merging, deduplicating, and resolving conflicting entries without modifying the base LLM

Fixed Base LLM

The language model whose parameters remain unchanged; it acts purely as a reader/reasoner over retrieved inference chains, enabling modular and open-source deployment

Results

Metric/Benchmark	Best RAG Baseline	Panini (This Paper)	Delta
Average QA accuracy (6 benchmarks)	Competitive baseline	+5–7% higher	+5–7%
Answer-context tokens used	Baseline token budget	2–30x fewer tokens	2–30x reduction
Unsupported answers (unanswerable queries)	Higher hallucination rate	Reduced unsupported answers	Qualitative improvement

Key Takeaways

Structuring documents as QA-pair graphs at write time — rather than storing raw chunks — is a practical strategy to simultaneously reduce inference-time compute and improve answer grounding in LLM-based QA systems
The GSW framework is fully compatible with open-source LLMs and requires no fine-tuning of the base model, making it straightforward to integrate into existing pipelines as a drop-in replacement for RAG
Continual consolidation of external memory (rather than append-only chunk storage) is key to maintaining coherent, non-redundant knowledge as document streams grow, and is an underexplored design dimension for production RAG systems

Abstract

Language models are increasingly used to reason over content they were not trained on, such as new documents, evolving knowledge, and user-specific data. A common approach is retrieval-augmented generation (RAG), which stores verbatim documents externally (as chunks) and retrieves only a relevant subset at inference time for an LLM to reason over. However, this results in inefficient usage of test-time compute (LLM repeatedly reasons over the same documents); moreover, chunk retrieval can inject irrelevant context that increases unsupported generation. We propose a human-like non-parametric continual learning framework, where the base model remains fixed, and learning occurs by integrating each new experience into an external semantic memory state that accumulates and consolidates itself continually. We present Panini, which realizes this by representing documents as Generative Semantic Workspaces (GSW) -- an entity- and event-aware network of question-answer (QA) pairs, sufficient for an LLM to reconstruct the experienced situations and mine latent knowledge via reasoning-grounded inference chains on the network. Given a query, Panini only traverses the continually-updated GSW (not the verbatim documents or chunks), and retrieves the most likely inference chains. Across six QA benchmarks, Panini achieves the highest average performance, 5%-7% higher than other competitive baselines, while using 2-30x fewer answer-context tokens, supports fully open-source pipelines, and reduces unsupported answers on curated unanswerable queries. The results show that efficient and accurate structuring of experiences at write time -- as achieved by the GSW framework -- yields both efficiency and reliability gains at read time. Code is available at https://github.com/roychowdhuryresearch/gsw-memory.