Doc-to-LoRA: Learning to Instantly Internalize Contexts
Problem Statement
Transformer-based LLMs suffer from quadratic attention cost over long sequences, making inference with large contexts memory-intensive and slow. While context distillation (CD) can encode information into model parameters, performing CD per-prompt at inference time is impractical due to high training costs and latency. There is no efficient mechanism to rapidly internalize arbitrary documents into model weights without retraining.
Key Novelty
- A meta-learned hypernetwork that generates task-specific LoRA adapters from an input document in a single forward pass, enabling amortized context distillation at inference time
- Demonstrated ability to handle sequence lengths exceeding the target LLM's native context window by more than 4x on needle-in-a-haystack tasks with near-perfect zero-shot accuracy
- Outperforms standard per-prompt context distillation on real-world QA benchmarks while significantly reducing peak memory consumption and update latency
Evaluation Highlights
- Near-perfect zero-shot accuracy on long-context needle-in-a-haystack tasks at sequence lengths >4x the target LLM's native context window
- On real-world QA datasets, D2L outperforms standard context distillation with significantly lower peak memory consumption and update latency under limited compute budgets
Breakthrough Assessment
Methodology
- Train a lightweight hypernetwork (D2L) in a meta-learning setup: given input documents paired with QA tasks, the hypernetwork learns to predict LoRA adapter weights for a frozen target LLM that minimize task loss
- At inference time, pass an unseen document through D2L in a single forward pass to generate a document-specific LoRA adapter, which is then merged into or applied alongside the target LLM
- Subsequent queries about the document are answered by the adapted LLM without re-consuming the original context, eliminating the KV-cache overhead and repeated attention over long sequences
System Components
A lightweight neural network that takes a document as input and outputs LoRA adapter weight matrices for a target LLM, trained via meta-learning to approximate context distillation in one forward pass
Low-rank weight delta matrices injected into the target LLM's layers to encode document-specific knowledge, enabling query answering without re-reading the original context
D2L is trained across many document-QA pairs so that the generated adapters generalize to unseen documents at inference time, amortizing the cost of per-document distillation
The base language model whose parameters remain fixed; it receives the generated LoRA adapter at inference to condition its responses on the internalized document content
Results
| Metric/Benchmark | Baseline (Standard CD) | Doc-to-LoRA (D2L) | Delta |
|---|---|---|---|
| Needle-in-Haystack Accuracy (>4x context window) | Degraded / fails beyond context limit | Near-perfect zero-shot accuracy | Significant improvement |
| Real-world QA Performance (limited compute) | Standard CD baseline | Outperforms standard CD | Positive |
| Peak Memory Consumption (inference) | High (full KV-cache for long context) | Significantly reduced | Large reduction |
| Update / Adaptation Latency | High (per-prompt training required) | Single forward pass | Order-of-magnitude reduction |
Key Takeaways
- D2L enables a new inference paradigm: precompute a LoRA adapter from a document once, then serve many queries cheaply without storing or reprocessing the full context, directly reducing KV-cache memory and latency in production deployments
- The meta-learning approach amortizes the cost of context distillation across documents, making it practical to frequently update or personalize LLMs with new knowledge without fine-tuning the base model
- The ability to exceed the native context window by 4x suggests D2L could serve as a practical workaround for context length limitations, relevant for document understanding, RAG pipelines, and long-context reasoning applications
Abstract
Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM's native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior.