Panini: Continual Learning in Token Space via Structured Memory
Problem Statement
RAG-based systems waste test-time compute by repeatedly feeding verbatim document chunks to LLMs for every query, and chunk retrieval often injects irrelevant context that leads to unsupported or hallucinated answers. Existing continual learning approaches typically require modifying model parameters, limiting flexibility and scalability. There is a need for a memory-efficient, modular approach that accumulates knowledge externally while keeping the base model fixed.
Key Novelty
- Generative Semantic Workspaces (GSW): a structured external memory representing documents as entity- and event-aware networks of QA pairs rather than verbatim text chunks
- Reasoning-grounded inference chains: at query time, Panini traverses the GSW graph to retrieve the most likely inference chains, enabling latent knowledge mining without raw document access
- Continual consolidation: the GSW accumulates and consolidates itself incrementally as new documents arrive, realizing a human-like non-parametric memory that supports open-source, fixed-model pipelines
Evaluation Highlights
- Panini achieves 5–7% higher average QA performance than competitive baselines across six QA benchmarks
- Panini uses 2–30x fewer answer-context tokens than RAG-based methods while also reducing unsupported answers on curated unanswerable queries
Breakthrough Assessment
Methodology
- Write time — Document ingestion: each incoming document is parsed into a Generative Semantic Workspace (GSW), an entity- and event-aware graph of QA pairs that captures the key factual and relational content of the document
- Memory consolidation: as new documents arrive, the GSW is continually updated and consolidated — merging overlapping entities/events and resolving conflicts — so the memory remains coherent without storing verbatim text
- Read time — Query answering: given a user query, Panini traverses the GSW (not raw documents) to identify the most relevant inference chains of QA pairs and uses these compact, structured chains as context for the LLM to generate a grounded answer
System Components
An external structured memory representing documents as a network of QA pairs organized around entities and events, enabling efficient storage and retrieval of semantic content without raw text
A write-time process that extracts entities and events from documents and links them through QA pairs to form a queryable knowledge graph
At query time, paths through the GSW graph that connect relevant QA pairs into coherent reasoning chains, providing the LLM with minimal but sufficient context for answer generation
Incrementally integrates new experiences (documents) into the existing GSW by merging, deduplicating, and resolving conflicting entries without modifying the base LLM
The language model whose parameters remain unchanged; it acts purely as a reader/reasoner over retrieved inference chains, enabling modular and open-source deployment
Results
| Metric/Benchmark | Best RAG Baseline | Panini (This Paper) | Delta |
|---|---|---|---|
| Average QA accuracy (6 benchmarks) | Competitive baseline | +5–7% higher | +5–7% |
| Answer-context tokens used | Baseline token budget | 2–30x fewer tokens | 2–30x reduction |
| Unsupported answers (unanswerable queries) | Higher hallucination rate | Reduced unsupported answers | Qualitative improvement |
Key Takeaways
- Structuring documents as QA-pair graphs at write time — rather than storing raw chunks — is a practical strategy to simultaneously reduce inference-time compute and improve answer grounding in LLM-based QA systems
- The GSW framework is fully compatible with open-source LLMs and requires no fine-tuning of the base model, making it straightforward to integrate into existing pipelines as a drop-in replacement for RAG
- Continual consolidation of external memory (rather than append-only chunk storage) is key to maintaining coherent, non-redundant knowledge as document streams grow, and is an underexplored design dimension for production RAG systems
Abstract
Language models are increasingly used to reason over content they were not trained on, such as new documents, evolving knowledge, and user-specific data. A common approach is retrieval-augmented generation (RAG), which stores verbatim documents externally (as chunks) and retrieves only a relevant subset at inference time for an LLM to reason over. However, this results in inefficient usage of test-time compute (LLM repeatedly reasons over the same documents); moreover, chunk retrieval can inject irrelevant context that increases unsupported generation. We propose a human-like non-parametric continual learning framework, where the base model remains fixed, and learning occurs by integrating each new experience into an external semantic memory state that accumulates and consolidates itself continually. We present Panini, which realizes this by representing documents as Generative Semantic Workspaces (GSW) -- an entity- and event-aware network of question-answer (QA) pairs, sufficient for an LLM to reconstruct the experienced situations and mine latent knowledge via reasoning-grounded inference chains on the network. Given a query, Panini only traverses the continually-updated GSW (not the verbatim documents or chunks), and retrieves the most likely inference chains. Across six QA benchmarks, Panini achieves the highest average performance, 5%-7% higher than other competitive baselines, while using 2-30x fewer answer-context tokens, supports fully open-source pipelines, and reduces unsupported answers on curated unanswerable queries. The results show that efficient and accurate structuring of experiences at write time -- as achieved by the GSW framework -- yields both efficiency and reliability gains at read time. Code is available at https://github.com/roychowdhuryresearch/gsw-memory.