VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

VTC-R1 proposes rendering intermediate reasoning text segments into compact images as 'optical memory' fed back into vision-language models, achieving significant token compression and inference speedup without complex external models. This creates a novel efficient reasoning paradigm that leverages visual modality to compress lengthy chain-of-thought traces.

Problem Statement

Long-context reasoning in LLMs introduces severe computational bottlenecks due to quadratic attention complexity scaling with token count. Existing compression approaches require complex additional training pipelines or external models, limiting scalability and often discarding critical fine-grained reasoning information. There is a need for a scalable, training-efficient method that preserves reasoning fidelity while drastically reducing token overhead.

Key Novelty

Novel 'optical memory' paradigm: rendering intermediate reasoning text segments into images and feeding them back into VLMs, replacing verbose token sequences with compact visual representations
3.4x token compression achieved by converting textual reasoning traces to images using a training dataset constructed from OpenR1-Math-220K, without requiring external compression models
Demonstrated compatibility with multiple VLM architectures (Glyph and Qwen3-VL), showing generalizability of the vision-text compression approach across different model families

Evaluation Highlights

2.7x speedup in end-to-end inference latency over standard long-context reasoning baselines across MATH500, AIME25, AMC23, and GPQA-D benchmarks
VTC-R1 consistently outperforms standard long-context reasoning on all tested benchmarks while maintaining or improving accuracy despite the significant compression ratio

Signal Assessment

7/10 The idea of using the visual modality as a lossless compression channel for text-based reasoning traces is genuinely creative and practically impactful, achieving substantial speedups without accuracy degradation. However, it relies on existing VLM OCR capabilities and the approach's long-term scalability and generalization to non-math domains remain to be validated.

Methodology

Construct a supervised fine-tuning dataset by taking OpenR1-Math-220K long-context reasoning traces, segmenting intermediate reasoning steps, and rendering those text segments into compact images to create vision-text interleaved training pairs with 3.4x fewer tokens
Fine-tune vision-language models (Glyph and Qwen3-VL) on this dataset so they learn to read and continue reasoning from rendered image segments ('optical memory') instead of long raw token sequences
At inference time, iteratively render completed reasoning segments into images, inject them back into the VLM context as visual tokens, and continue generation — progressively compressing the growing context window to maintain efficiency

System Components

Text-to-Image Renderer

Converts intermediate reasoning text segments into compact rasterized images, effectively replacing verbose token sequences with dense visual representations

Optical Memory Module

The mechanism by which rendered reasoning images are fed back into the VLM's context, allowing the model to attend to prior reasoning steps visually rather than via raw text tokens

VTC Training Dataset

A curated dataset derived from OpenR1-Math-220K with vision-text interleaved reasoning traces achieving 3.4x token compression, used to fine-tune VLMs for this new reasoning paradigm

Fine-tuned VLM Backbone

Representative VLMs (Glyph and Qwen3-VL) adapted via supervised fine-tuning to process and generate reasoning with optical memory inputs

Results

Metric/Benchmark	Baseline (Standard Long-Context)	VTC-R1	Delta
Token Count	1x (baseline)	~0.29x (3.4x compression)	-71% tokens
End-to-End Latency	1x (baseline)	~0.37x (2.7x speedup)	-63% latency
MATH500 Accuracy	Competitive baseline	Outperforms	Positive
AIME25 Accuracy	Competitive baseline	Outperforms	Positive
GPQA-D Accuracy	Competitive baseline	Outperforms	Positive

Key Takeaways

Rendering text as images is a viable and practical compression strategy for long-context LLM reasoning — ML engineers can leverage existing VLM OCR capabilities as a 'free' compression channel without building dedicated compression models
The 2.7x latency improvement with maintained accuracy makes VTC-R1 directly relevant for production deployment of reasoning-heavy applications where inference cost is a primary constraint
The approach's compatibility with multiple VLM families (Glyph, Qwen3-VL) suggests it can be adapted to future frontier VLMs, making it a potentially durable technique as context lengths continue to grow

Abstract

Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as"optical memory."We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at https://github.com/w-yibo/VTC-R1.