Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

Mavors is a multi-granularity video representation framework for MLLMs that uses chunk-based 3D encoding and transformer-based temporal aggregation to preserve both high-resolution spatial details and temporal continuity in long videos. It unifies image and video understanding by treating images as single-frame videos with sub-image decomposition.

Problem Statement

Long-context video understanding in MLLMs requires balancing computational efficiency with retention of fine-grained spatio-temporal patterns, a challenge that existing methods fail to meet. Sparse sampling loses temporal dynamics, dense sampling with low resolution degrades spatial fidelity, and token compression discards subtle interactions. These limitations are especially pronounced in videos with complex motion or varying resolutions.

Key Novelty

Intra-chunk Vision Encoder (IVE) using 3D convolutions combined with Vision Transformers to preserve high-resolution spatial features within video chunks
Inter-chunk Feature Aggregator (IFA) with transformer-based dependency modeling and chunk-level rotary position encodings for cross-chunk temporal coherence
Unified image-video understanding architecture that treats images as single-frame videos via sub-image decomposition, eliminating the need for separate processing pipelines

Evaluation Highlights

Mavors significantly outperforms existing methods on diverse benchmarks requiring fine-grained spatio-temporal reasoning, including tasks with complex motion and varying resolutions
The framework demonstrates superiority in both spatial fidelity and temporal continuity metrics compared to sparse sampling, dense low-resolution, and token compression baselines

Breakthrough Assessment

6/10 Mavors presents a well-motivated and technically coherent solution to a real limitation in video MLLMs, combining 3D convolutions, ViTs, and chunk-level temporal modeling in a novel way. However, the core building blocks are established techniques, and the contribution is primarily an effective architectural combination rather than a paradigm shift.

Methodology

Segment input video into temporal chunks and pass each chunk through the Intra-chunk Vision Encoder (IVE), which applies 3D convolutions to capture local spatio-temporal patterns followed by a Vision Transformer to produce high-resolution latent representations per chunk
Feed the sequence of chunk-level latent representations into the Inter-chunk Feature Aggregator (IFA), a transformer module with chunk-level rotary position encodings that models long-range temporal dependencies across the entire video
Pass the resulting multi-granularity video representation (combining fine spatial detail from IVE and global temporal structure from IFA) to the MLLM backbone for downstream language-based reasoning and generation

System Components

Intra-chunk Vision Encoder (IVE)

Encodes each video chunk using 3D convolutions to capture local motion and spatial structure, followed by a Vision Transformer to produce rich high-resolution latent features per chunk

Inter-chunk Feature Aggregator (IFA)

A transformer-based module that takes chunk-level representations and uses chunk-level rotary position encodings to model temporal coherence and long-range dependencies across the full video

Sub-image Decomposition

Decomposes static images into sub-image patches and treats them as single-frame videos, enabling the same pipeline to handle both image and video inputs without architectural modifications

Chunk-level Rotary Position Encoding

A positional encoding scheme applied at the chunk granularity in the IFA to preserve temporal ordering and relative position information across long video sequences

Results

Benchmark Category	Baseline Approach	Mavors	Delta
Fine-grained spatio-temporal reasoning	Sparse sampling	Significantly higher	Substantial gain
Complex motion understanding	Dense low-resolution sampling	Significantly higher	Substantial gain
Spatial detail retention	Token compression methods	Significantly higher	Substantial gain
Unified image understanding	Separate image pipeline	Competitive/superior	Architecture simplification

Key Takeaways

Chunked 3D convolutional encoding before ViT processing is an effective strategy to retain local spatio-temporal structure in long videos without sacrificing resolution, and is worth adopting when building video MLLMs
Treating temporal aggregation as a second-stage transformer problem with chunk-level positional encodings (rather than flattening all frame tokens) is a scalable design pattern for long-video modeling that reduces sequence length pressure on the LLM backbone
Unifying image and video pipelines via sub-image decomposition simplifies training and deployment; ML practitioners building multimodal systems should consider this design to avoid maintaining separate encoders and training regimes

Abstract

Long-context video understanding in Multimodal Large Language Models (MLLMs) faces a critical challenge: balancing computational efficiency with the retention of fine-grained spatio-temporal patterns. Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose Mavors, a novel framework that introduces Multi-granularity video representation for holistic long-video modeling. Specifically, Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings. Moreover, the framework unifies image and video understanding by treating images as single-frame videos via sub-image decomposition. Experiments across diverse benchmarks demonstrate Mavors' superiority in maintaining both spatial fidelity and temporal continuity, significantly outperforming existing methods in tasks requiring fine-grained spatio-temporal reasoning.