← Back to Papers

Doc-to-LoRA: Learning to Instantly Internalize Contexts

Rujikorn Charakorn, Edoardo Cetin, Shinnosuke Uesaka, R. Lange
2026
Doc-to-LoRA (D2L) is a hypernetwork that meta-learns to perform approximate context distillation in a single forward pass, generating LoRA adapters from input documents so that LLMs can answer queries without re-processing the original context.

Problem Statement

Transformer-based LLMs suffer from quadratic attention cost over long sequences, making inference with large contexts memory-intensive and slow. While context distillation (CD) can encode information into model parameters, performing CD per-prompt at inference time is impractical due to high training costs and latency. There is no efficient mechanism to rapidly internalize arbitrary documents into model weights without retraining.

Key Novelty

  • A meta-learned hypernetwork that generates task-specific LoRA adapters from an input document in a single forward pass, enabling amortized context distillation at inference time
  • Demonstrated ability to handle sequence lengths exceeding the target LLM's native context window by more than 4x on needle-in-a-haystack tasks with near-perfect zero-shot accuracy
  • Outperforms standard per-prompt context distillation on real-world QA benchmarks while significantly reducing peak memory consumption and update latency

Evaluation Highlights

  • Near-perfect zero-shot accuracy on long-context needle-in-a-haystack tasks at sequence lengths >4x the target LLM's native context window
  • On real-world QA datasets, D2L outperforms standard context distillation with significantly lower peak memory consumption and update latency under limited compute budgets

Breakthrough Assessment

7/10 D2L introduces a compelling and practical paradigm for amortized context distillation, combining hypernetwork meta-learning with LoRA generation to solve a real inference bottleneck in LLMs. While building on known components (hypernetworks, LoRA, CD), the combination and its demonstrated scalability beyond native context windows represent a significant advance for efficient long-context inference.

Methodology

  1. Train a lightweight hypernetwork (D2L) in a meta-learning setup: given input documents paired with QA tasks, the hypernetwork learns to predict LoRA adapter weights for a frozen target LLM that minimize task loss
  2. At inference time, pass an unseen document through D2L in a single forward pass to generate a document-specific LoRA adapter, which is then merged into or applied alongside the target LLM
  3. Subsequent queries about the document are answered by the adapted LLM without re-consuming the original context, eliminating the KV-cache overhead and repeated attention over long sequences

System Components

Doc-to-LoRA Hypernetwork (D2L)

A lightweight neural network that takes a document as input and outputs LoRA adapter weight matrices for a target LLM, trained via meta-learning to approximate context distillation in one forward pass

LoRA Adapter

Low-rank weight delta matrices injected into the target LLM's layers to encode document-specific knowledge, enabling query answering without re-reading the original context

Meta-Learning Training Objective

D2L is trained across many document-QA pairs so that the generated adapters generalize to unseen documents at inference time, amortizing the cost of per-document distillation

Frozen Target LLM

The base language model whose parameters remain fixed; it receives the generated LoRA adapter at inference to condition its responses on the internalized document content

Results

Metric/Benchmark Baseline (Standard CD) Doc-to-LoRA (D2L) Delta
Needle-in-Haystack Accuracy (>4x context window) Degraded / fails beyond context limit Near-perfect zero-shot accuracy Significant improvement
Real-world QA Performance (limited compute) Standard CD baseline Outperforms standard CD Positive
Peak Memory Consumption (inference) High (full KV-cache for long context) Significantly reduced Large reduction
Update / Adaptation Latency High (per-prompt training required) Single forward pass Order-of-magnitude reduction

Key Takeaways

  • D2L enables a new inference paradigm: precompute a LoRA adapter from a document once, then serve many queries cheaply without storing or reprocessing the full context, directly reducing KV-cache memory and latency in production deployments
  • The meta-learning approach amortizes the cost of context distillation across documents, making it practical to frequently update or personalize LLMs with new knowledge without fine-tuning the base model
  • The ability to exceed the native context window by 4x suggests D2L could serve as a practical workaround for context length limitations, relevant for document understanding, RAG pipelines, and long-context reasoning applications

Abstract

Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM's native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior.

Generated on 2026-03-02 using Claude