AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving
Problem Statement
Current VLMs applied to autonomous driving suffer from hallucinations, inefficient reasoning chains, and lack of real-world validation mechanisms, making them unreliable for safety-critical tasks. Existing approaches lack structured tool usage during reasoning, limiting their ability to ground decisions in verifiable, multi-step evidence. There is also no standardized protocol to evaluate how well models invoke and utilize external tools during driving tasks.
Key Novelty
- Autonomous driving tool library with structured, self-verified reasoning data generation that explicitly incorporates tool usage across diverse driving scenarios
- Two-stage training pipeline combining Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to teach VLMs autonomous tool invocation
- Agent-style Tool-Usage Evaluation protocol introducing a novel multi-tool assessment framework to rigorously benchmark tool invocation quality and utilization
Evaluation Highlights
- 53.91% improvement in overall reasoning scores on the DriveLMM-o1 benchmark over baseline
- 33.54% improvement in answer accuracy on DriveLMM-o1, with strong zero-shot and few-shot generalization across multiple benchmarks
Breakthrough Assessment
Methodology
- Step 1 - Structured Data Generation: Build an autonomous driving tool library and automatically construct self-verified, tool-augmented reasoning datasets covering diverse driving scenarios
- Step 2 - Two-Stage Training: First apply Supervised Fine-Tuning (SFT) to teach the VLM structured tool-calling behavior, then apply Group Relative Policy Optimization (GRPO) to reinforce accurate and consistent autonomous tool invocation
- Step 3 - Agent-style Evaluation: Assess the trained model using a multi-tool evaluation protocol on DriveLMM-o1 and additional benchmarks, measuring both reasoning quality and tool usage accuracy
System Components
A curated collection of driving-relevant tools (e.g., object detection, depth estimation, scene parsing) used to ground reasoning steps in verifiable external outputs
Automatically constructs self-verified, tool-augmented chain-of-thought training data by simulating tool calls and validating outputs for diverse driving scenarios
A two-stage training approach where SFT instills structured tool-calling format and GRPO optimizes policy for correct, consistent tool invocation through group-relative reward signals
A novel multi-tool assessment framework that evaluates not just final answer accuracy but also the correctness and relevance of intermediate tool invocations during reasoning
Results
| Metric/Benchmark | Baseline | This Paper | Delta |
|---|---|---|---|
| Overall Reasoning Score (DriveLMM-o1) | Baseline VLM | +53.91% improvement | +53.91% |
| Answer Accuracy (DriveLMM-o1) | Baseline VLM | +33.54% improvement | +33.54% |
| Zero-shot Generalization | Limited | Strong across benchmarks | Qualitative gain |
| Few-shot Generalization | Limited | Strong across benchmarks | Qualitative gain |
Key Takeaways
- Integrating tool invocation directly into CoT reasoning pipelines (rather than post-hoc) substantially reduces hallucinations and improves grounding in VLMs for safety-critical applications like autonomous driving
- Combining SFT for behavior cloning of tool-use patterns with GRPO for policy optimization is an effective recipe for teaching LLMs/VLMs structured agentic behavior without requiring massive manually annotated datasets
- Designing task-specific evaluation protocols that measure tool invocation quality—not just final output accuracy—is essential for validating agentic AI systems and should be adopted more broadly in multi-modal agent research
Abstract
Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce \textbf{AgentThink}, a pioneering unified framework that integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink's core innovations include: \textbf{(i) Structured Data Generation}, which establishes an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; \textbf{(ii) A Two-stage Training Pipeline}, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and \textbf{(iii) Agent-style Tool-Usage Evaluation}, introducing a novel multi-tool assessment protocol to rigorously evaluate the model's tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate that AgentThink significantly boosts overall reasoning scores by \textbf{53.91%} and enhances answer accuracy by \textbf{33.54%}, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models. Code is available at https://github.com/curryqka/AgentThink.