VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning
Problem Statement
Long-context reasoning in LLMs introduces severe computational bottlenecks due to quadratic attention complexity scaling with token count. Existing compression approaches require complex additional training pipelines or external models, limiting scalability and often discarding critical fine-grained reasoning information. There is a need for a scalable, training-efficient method that preserves reasoning fidelity while drastically reducing token overhead.
Key Novelty
- Novel 'optical memory' paradigm: rendering intermediate reasoning text segments into images and feeding them back into VLMs, replacing verbose token sequences with compact visual representations
- 3.4x token compression achieved by converting textual reasoning traces to images using a training dataset constructed from OpenR1-Math-220K, without requiring external compression models
- Demonstrated compatibility with multiple VLM architectures (Glyph and Qwen3-VL), showing generalizability of the vision-text compression approach across different model families
Evaluation Highlights
- 2.7x speedup in end-to-end inference latency over standard long-context reasoning baselines across MATH500, AIME25, AMC23, and GPQA-D benchmarks
- VTC-R1 consistently outperforms standard long-context reasoning on all tested benchmarks while maintaining or improving accuracy despite the significant compression ratio
Breakthrough Assessment
Methodology
- Construct a supervised fine-tuning dataset by taking OpenR1-Math-220K long-context reasoning traces, segmenting intermediate reasoning steps, and rendering those text segments into compact images to create vision-text interleaved training pairs with 3.4x fewer tokens
- Fine-tune vision-language models (Glyph and Qwen3-VL) on this dataset so they learn to read and continue reasoning from rendered image segments ('optical memory') instead of long raw token sequences
- At inference time, iteratively render completed reasoning segments into images, inject them back into the VLM context as visual tokens, and continue generation — progressively compressing the growing context window to maintain efficiency
System Components
Converts intermediate reasoning text segments into compact rasterized images, effectively replacing verbose token sequences with dense visual representations
The mechanism by which rendered reasoning images are fed back into the VLM's context, allowing the model to attend to prior reasoning steps visually rather than via raw text tokens
A curated dataset derived from OpenR1-Math-220K with vision-text interleaved reasoning traces achieving 3.4x token compression, used to fine-tune VLMs for this new reasoning paradigm
Representative VLMs (Glyph and Qwen3-VL) adapted via supervised fine-tuning to process and generate reasoning with optical memory inputs
Results
| Metric/Benchmark | Baseline (Standard Long-Context) | VTC-R1 | Delta |
|---|---|---|---|
| Token Count | 1x (baseline) | ~0.29x (3.4x compression) | -71% tokens |
| End-to-End Latency | 1x (baseline) | ~0.37x (2.7x speedup) | -63% latency |
| MATH500 Accuracy | Competitive baseline | Outperforms | Positive |
| AIME25 Accuracy | Competitive baseline | Outperforms | Positive |
| GPQA-D Accuracy | Competitive baseline | Outperforms | Positive |
Key Takeaways
- Rendering text as images is a viable and practical compression strategy for long-context LLM reasoning — ML engineers can leverage existing VLM OCR capabilities as a 'free' compression channel without building dedicated compression models
- The 2.7x latency improvement with maintained accuracy makes VTC-R1 directly relevant for production deployment of reasoning-heavy applications where inference cost is a primary constraint
- The approach's compatibility with multiple VLM families (Glyph, Qwen3-VL) suggests it can be adapted to future frontier VLMs, making it a potentially durable technique as context lengths continue to grow
Abstract
Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as"optical memory."We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at https://github.com/w-yibo/VTC-R1.