Doc-to-LoRA: Learning to Instantly Internalize Contexts

Doc-to-LoRA (D2L) is a hypernetwork that meta-learns to perform approximate context distillation in a single forward pass, generating LoRA adapters from input documents so that LLMs can answer queries without re-processing the original context.

Problem Statement

Transformer-based LLMs suffer from quadratic attention cost over long sequences, making inference with large contexts memory-intensive and slow. While context distillation (CD) can encode information into model parameters, performing CD per-prompt at inference time is impractical due to high training costs and latency. There is no efficient mechanism to rapidly internalize arbitrary documents into model weights without retraining.

Key Novelty

A meta-learned hypernetwork that generates task-specific LoRA adapters from an input document in a single forward pass, enabling amortized context distillation at inference time
Demonstrated ability to handle sequence lengths exceeding the target LLM's native context window by more than 4x on needle-in-a-haystack tasks with near-perfect zero-shot accuracy
Outperforms standard per-prompt context distillation on real-world QA benchmarks while significantly reducing peak memory consumption and update latency

Evaluation Highlights

Near-perfect zero-shot accuracy on long-context needle-in-a-haystack tasks at sequence lengths >4x the target LLM's native context window
On real-world QA datasets, D2L outperforms standard context distillation with significantly lower peak memory consumption and update latency under limited compute budgets

Signal Assessment

7/10 D2L introduces a compelling and practical paradigm for amortized context distillation, combining hypernetwork meta-learning with LoRA generation to solve a real inference bottleneck in LLMs. While building on known components (hypernetworks, LoRA, CD), the combination and its demonstrated scalability beyond native context windows represent a significant advance for efficient long-context inference.

Methodology

Train a lightweight hypernetwork (D2L) in a meta-learning setup: given input documents paired with QA tasks, the hypernetwork learns to predict LoRA adapter weights for a frozen target LLM that minimize task loss
At inference time, pass an unseen document through D2L in a single forward pass to generate a document-specific LoRA adapter, which is then merged into or applied alongside the target LLM
Subsequent queries about the document are answered by the adapted LLM without re-consuming the original context, eliminating the KV-cache overhead and repeated attention over long sequences

System Components

Doc-to-LoRA Hypernetwork (D2L)

A lightweight neural network that takes a document as input and outputs LoRA adapter weight matrices for a target LLM, trained via meta-learning to approximate context distillation in one forward pass

LoRA Adapter

Low-rank weight delta matrices injected into the target LLM's layers to encode document-specific knowledge, enabling query answering without re-reading the original context

Meta-Learning Training Objective

D2L is trained across many document-QA pairs so that the generated adapters generalize to unseen documents at inference time, amortizing the cost of per-document distillation

Frozen Target LLM

The base language model whose parameters remain fixed; it receives the generated LoRA adapter at inference to condition its responses on the internalized document content

Results

Metric/Benchmark	Baseline (Standard CD)	Doc-to-LoRA (D2L)	Delta
Needle-in-Haystack Accuracy (>4x context window)	Degraded / fails beyond context limit	Near-perfect zero-shot accuracy	Significant improvement
Real-world QA Performance (limited compute)	Standard CD baseline	Outperforms standard CD	Positive
Peak Memory Consumption (inference)	High (full KV-cache for long context)	Significantly reduced	Large reduction
Update / Adaptation Latency	High (per-prompt training required)	Single forward pass	Order-of-magnitude reduction

Key Takeaways

D2L enables a new inference paradigm: precompute a LoRA adapter from a document once, then serve many queries cheaply without storing or reprocessing the full context, directly reducing KV-cache memory and latency in production deployments
The meta-learning approach amortizes the cost of context distillation across documents, making it practical to frequently update or personalize LLMs with new knowledge without fine-tuning the base model
The ability to exceed the native context window by 4x suggests D2L could serve as a practical workaround for context length limitations, relevant for document understanding, RAG pipelines, and long-context reasoning applications

Abstract

Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM's native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior.