Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Problem Statement
Long-context video understanding in MLLMs requires balancing computational efficiency with retention of fine-grained spatio-temporal patterns, a challenge that existing methods fail to meet. Sparse sampling loses temporal dynamics, dense sampling with low resolution degrades spatial fidelity, and token compression discards subtle interactions. These limitations are especially pronounced in videos with complex motion or varying resolutions.
Key Novelty
- Intra-chunk Vision Encoder (IVE) using 3D convolutions combined with Vision Transformers to preserve high-resolution spatial features within video chunks
- Inter-chunk Feature Aggregator (IFA) with transformer-based dependency modeling and chunk-level rotary position encodings for cross-chunk temporal coherence
- Unified image-video understanding architecture that treats images as single-frame videos via sub-image decomposition, eliminating the need for separate processing pipelines
Evaluation Highlights
- Mavors significantly outperforms existing methods on diverse benchmarks requiring fine-grained spatio-temporal reasoning, including tasks with complex motion and varying resolutions
- The framework demonstrates superiority in both spatial fidelity and temporal continuity metrics compared to sparse sampling, dense low-resolution, and token compression baselines
Breakthrough Assessment
Methodology
- Segment input video into temporal chunks and pass each chunk through the Intra-chunk Vision Encoder (IVE), which applies 3D convolutions to capture local spatio-temporal patterns followed by a Vision Transformer to produce high-resolution latent representations per chunk
- Feed the sequence of chunk-level latent representations into the Inter-chunk Feature Aggregator (IFA), a transformer module with chunk-level rotary position encodings that models long-range temporal dependencies across the entire video
- Pass the resulting multi-granularity video representation (combining fine spatial detail from IVE and global temporal structure from IFA) to the MLLM backbone for downstream language-based reasoning and generation
System Components
Encodes each video chunk using 3D convolutions to capture local motion and spatial structure, followed by a Vision Transformer to produce rich high-resolution latent features per chunk
A transformer-based module that takes chunk-level representations and uses chunk-level rotary position encodings to model temporal coherence and long-range dependencies across the full video
Decomposes static images into sub-image patches and treats them as single-frame videos, enabling the same pipeline to handle both image and video inputs without architectural modifications
A positional encoding scheme applied at the chunk granularity in the IFA to preserve temporal ordering and relative position information across long video sequences
Results
| Benchmark Category | Baseline Approach | Mavors | Delta |
|---|---|---|---|
| Fine-grained spatio-temporal reasoning | Sparse sampling | Significantly higher | Substantial gain |
| Complex motion understanding | Dense low-resolution sampling | Significantly higher | Substantial gain |
| Spatial detail retention | Token compression methods | Significantly higher | Substantial gain |
| Unified image understanding | Separate image pipeline | Competitive/superior | Architecture simplification |
Key Takeaways
- Chunked 3D convolutional encoding before ViT processing is an effective strategy to retain local spatio-temporal structure in long videos without sacrificing resolution, and is worth adopting when building video MLLMs
- Treating temporal aggregation as a second-stage transformer problem with chunk-level positional encodings (rather than flattening all frame tokens) is a scalable design pattern for long-video modeling that reduces sequence length pressure on the LLM backbone
- Unifying image and video pipelines via sub-image decomposition simplifies training and deployment; ML practitioners building multimodal systems should consider this design to avoid maintaining separate encoders and training regimes
Abstract
Long-context video understanding in Multimodal Large Language Models (MLLMs) faces a critical challenge: balancing computational efficiency with the retention of fine-grained spatio-temporal patterns. Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose Mavors, a novel framework that introduces Multi-granularity video representation for holistic long-video modeling. Specifically, Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings. Moreover, the framework unifies image and video understanding by treating images as single-frame videos via sub-image decomposition. Experiments across diverse benchmarks demonstrate Mavors' superiority in maintaining both spatial fidelity and temporal continuity, significantly outperforming existing methods in tasks requiring fine-grained spatio-temporal reasoning.