AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving

Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, Yining Shi, H. Lim, Li Liu, T. Zhou, Hongyi Wang, Huan Yu, Yifei Hu, Guang Li, Guangyao Chen, Hao Ye, Lijun Sun, Diange Yang

Conference on Empirical Methods in Natural Language Processing | 2025

Semantic Scholar

Reasoning & Planning Agents Multimodal Prompt Engineering Benchmark

AgentThink is a unified framework that combines Chain-of-Thought reasoning with dynamic, agent-style tool invocation in Vision-Language Models to enable more accurate, trustworthy, and verifiable autonomous driving perception and decision-making.

Problem Statement

Current VLMs applied to autonomous driving suffer from hallucinations, inefficient reasoning chains, and lack of real-world validation mechanisms, making them unreliable for safety-critical tasks. Existing approaches lack structured tool usage during reasoning, limiting their ability to ground decisions in verifiable, multi-step evidence. There is also no standardized protocol to evaluate how well models invoke and utilize external tools during driving tasks.

Key Novelty

Autonomous driving tool library with structured, self-verified reasoning data generation that explicitly incorporates tool usage across diverse driving scenarios
Two-stage training pipeline combining Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to teach VLMs autonomous tool invocation
Agent-style Tool-Usage Evaluation protocol introducing a novel multi-tool assessment framework to rigorously benchmark tool invocation quality and utilization

Evaluation Highlights

53.91% improvement in overall reasoning scores on the DriveLMM-o1 benchmark over baseline
33.54% improvement in answer accuracy on DriveLMM-o1, with strong zero-shot and few-shot generalization across multiple benchmarks

Breakthrough Assessment

7/10 AgentThink represents a significant advance by systematically integrating agentic tool use with CoT reasoning in VLMs for autonomous driving, backed by a novel training pipeline and evaluation protocol; however, it builds on established GRPO and SFT techniques rather than introducing a fundamentally new learning paradigm.

Methodology

Step 1 - Structured Data Generation: Build an autonomous driving tool library and automatically construct self-verified, tool-augmented reasoning datasets covering diverse driving scenarios
Step 2 - Two-Stage Training: First apply Supervised Fine-Tuning (SFT) to teach the VLM structured tool-calling behavior, then apply Group Relative Policy Optimization (GRPO) to reinforce accurate and consistent autonomous tool invocation
Step 3 - Agent-style Evaluation: Assess the trained model using a multi-tool evaluation protocol on DriveLMM-o1 and additional benchmarks, measuring both reasoning quality and tool usage accuracy

System Components

Autonomous Driving Tool Library

A curated collection of driving-relevant tools (e.g., object detection, depth estimation, scene parsing) used to ground reasoning steps in verifiable external outputs

Structured Data Generation Pipeline

Automatically constructs self-verified, tool-augmented chain-of-thought training data by simulating tool calls and validating outputs for diverse driving scenarios

SFT + GRPO Training Pipeline

A two-stage training approach where SFT instills structured tool-calling format and GRPO optimizes policy for correct, consistent tool invocation through group-relative reward signals

Agent-style Tool-Usage Evaluation Protocol

A novel multi-tool assessment framework that evaluates not just final answer accuracy but also the correctness and relevance of intermediate tool invocations during reasoning

Results

Metric/Benchmark	Baseline	This Paper	Delta
Overall Reasoning Score (DriveLMM-o1)	Baseline VLM	+53.91% improvement	+53.91%
Answer Accuracy (DriveLMM-o1)	Baseline VLM	+33.54% improvement	+33.54%
Zero-shot Generalization	Limited	Strong across benchmarks	Qualitative gain
Few-shot Generalization	Limited	Strong across benchmarks	Qualitative gain

Key Takeaways

Integrating tool invocation directly into CoT reasoning pipelines (rather than post-hoc) substantially reduces hallucinations and improves grounding in VLMs for safety-critical applications like autonomous driving
Combining SFT for behavior cloning of tool-use patterns with GRPO for policy optimization is an effective recipe for teaching LLMs/VLMs structured agentic behavior without requiring massive manually annotated datasets
Designing task-specific evaluation protocols that measure tool invocation quality—not just final output accuracy—is essential for validating agentic AI systems and should be adopted more broadly in multi-modal agent research

Abstract

Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce \textbf{AgentThink}, a pioneering unified framework that integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink's core innovations include: \textbf{(i) Structured Data Generation}, which establishes an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; \textbf{(ii) A Two-stage Training Pipeline}, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and \textbf{(iii) Agent-style Tool-Usage Evaluation}, introducing a novel multi-tool assessment protocol to rigorously evaluate the model's tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate that AgentThink significantly boosts overall reasoning scores by \textbf{53.91%} and enhances answer accuracy by \textbf{33.54%}, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models. Code is available at https://github.com/curryqka/AgentThink.

Generated on 2026-03-03 using Claude