Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

Rex-Thinker reformulates object referring as an explicit Chain-of-Thought reasoning task, enabling grounded, verifiable, and trustworthy predictions by reasoning step-by-step over candidate objects before making final bounding box decisions.

Problem Statement

Most object referring models treat the task as direct bounding box regression, offering no interpretability and failing to reject expressions when no matching object exists in the image. This leads to hallucinated predictions and opaque decision-making that cannot be audited or trusted in real-world deployments. There is a critical gap between the model's internal reasoning and its output, making it unreliable for safety-sensitive applications.

Key Novelty

Formulation of object referring as an explicit CoT reasoning task with structured planning, action, and summarization steps per candidate object — a novel paradigm shift from direct regression
Construction of HumanRef-CoT, a large-scale GPT-4o-generated dataset providing structured reasoning traces for training grounded referring models
Two-stage training pipeline combining cold-start supervised fine-tuning for structured reasoning acquisition followed by GRPO-based reinforcement learning for accuracy and generalization improvement

Evaluation Highlights

Outperforms standard baselines in precision on in-domain evaluation on HumanRef benchmark while providing interpretable reasoning chains
Demonstrates improved ability to reject hallucinated outputs (abstention on no-match expressions) and strong out-of-domain generalization compared to direct prediction baselines

Breakthrough Assessment

7/10 Rex-Thinker represents a significant advance by bridging grounded visual reasoning and object referring, introducing trustworthy abstention and interpretability — properties largely absent in prior work — though it builds on established components like CoT, GRPO, and existing referring datasets.

Methodology

Step 1 — Candidate Identification: Given a referring expression, identify all candidate object instances in the image corresponding to the referred object category
Step 2 — Step-by-step CoT Reasoning: For each candidate, perform structured reasoning (plan, act, summarize) to assess whether the candidate matches the referring expression based on visual evidence
Step 3 — Two-Stage Training: First, cold-start SFT on HumanRef-CoT to teach structured reasoning format; then GRPO-based RL fine-tuning to optimize prediction accuracy and generalization

System Components

Rex-Thinker Model

A multimodal LLM-based model that performs explicit chain-of-thought reasoning over object candidates before predicting bounding boxes for object referring

HumanRef-CoT Dataset

Large-scale dataset constructed by prompting GPT-4o on HumanRef, providing structured CoT reasoning traces in planning-action-summarization format for each referring sample

Candidate Instance Identification

A preliminary stage that enumerates all object instances of the referred category in the image, providing a structured set of candidates for subsequent reasoning

GRPO-based RL Training

Group Relative Policy Optimization reinforcement learning stage applied after SFT to improve prediction accuracy, reduce hallucinations, and enhance out-of-domain generalization

Abstention Mechanism

Learned capability to reject or abstain from making predictions when no object in the image satisfies the given referring expression, reducing false positive hallucinations

Results

Metric/Benchmark	Baseline (Direct Prediction)	Rex-Thinker	Delta
In-domain Precision (HumanRef)	Lower	Higher	Improved
Hallucination Rejection Rate	Poor (no abstention)	Better abstention	Significant improvement
Out-of-domain Generalization	Weaker	Stronger	Improved
Interpretability	None (black-box bbox)	Verifiable CoT traces	Qualitative gain

Key Takeaways

Framing object referring as a CoT reasoning task rather than direct regression simultaneously improves accuracy, interpretability, and the ability to reject hallucinated predictions — a strong template for other grounded vision tasks
GRPO-based RL fine-tuning on top of SFT cold-start is an effective recipe for teaching LMMs structured reasoning that generalizes out-of-domain, suggesting this two-stage paradigm is broadly applicable
Constructing structured reasoning datasets via GPT-4o distillation (HumanRef-CoT) is a scalable strategy for bootstrapping interpretable reasoning capabilities in smaller models without expensive human annotation

Abstract

Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object instances corresponding to the referred object category. Rex-Thinker then performs step-by-step reasoning over each candidate to assess whether it matches the given expression, before making a final prediction. To support this paradigm, we construct a large-scale CoT-style referring dataset named HumanRef-CoT by prompting GPT-4o on the HumanRef dataset. Each reasoning trace follows a structured planning, action, and summarization format, enabling the model to learn decomposed, interpretable reasoning over object candidates. We then train Rex-Thinker in two stages: a cold-start supervised fine-tuning phase to teach the model how to perform structured reasoning, followed by GRPO-based RL learning to improve accuracy and generalization. Experiments show that our approach outperforms standard baselines in both precision and interpretability on in-domain evaluation, while also demonstrating improved ability to reject hallucinated outputs and strong generalization in out-of-domain settings.