Large Language Model Based Autonomous Task Planning for Abstract Commands
Problem Statement
Existing LLM-based task planning for robots typically requires explicit, concrete commands that specify target objects and their locations, limiting usability for everyday users who communicate abstractly. Abstract commands require robots to autonomously infer task-relevant objects and spatial relationships from their environment, which prior systems cannot handle. This gap prevents practical deployment of robot assistants in unstructured real-world settings.
Key Novelty
- Two-phase framework separating environment recognition from task planning, enabling robots to handle underspecified, abstract commands
- Use of a large vision-language model (VLM) to generate a hierarchical scene graph that encodes object-space relationships for downstream LLM reasoning
- Demonstration that structured environmental representations (scene graphs) can bridge the gap between abstract user intent and concrete executable robot actions
Evaluation Highlights
- Framework validated in AI2THOR simulation environment, showing superior task execution performance on abstract commands compared to baseline approaches
- Qualitative and quantitative comparisons demonstrate improved success rates in interpreting and executing tasks where target objects/locations are not explicitly named
Breakthrough Assessment
Methodology
- Environment Recognition Phase: A vision-language model perceives the robot's surroundings and constructs a hierarchical scene graph capturing spatial relationships and object categories across the environment
- Task Planning Phase: An LLM ingests the hierarchical scene graph alongside the abstract natural language command to reason about which objects and locations are relevant and generates a structured task plan
- Execution & Validation: The generated task plan is executed in the AI2THOR simulation environment, and success is measured against ground-truth task completion criteria for abstract command scenarios
System Components
A large vision-language model that processes visual observations of the robot's environment and outputs a structured graph representing objects, their attributes, and spatial/semantic relationships to surrounding spaces
An LLM that takes the scene graph and an abstract user command as input and produces a concrete, step-by-step task plan by inferring relevant objects and target locations
Used as the validation testbed, providing a realistic indoor 3D environment where robot agents can execute and evaluate the generated task plans
Results
| Metric/Benchmark | Baseline (Explicit Commands / Prior LLM Planning) | This Paper (Abstract Command Framework) | Delta |
|---|---|---|---|
| Task Success Rate on Abstract Commands | Lower / not designed for abstract input | Superior performance (specific numbers not reported in abstract) | Positive improvement |
| Handling of Underspecified Object Targets | Fails or requires clarification | Autonomous resolution via scene graph | Qualitative improvement |
| Environment: AI2THOR Simulation | Standard LLM task planners | Proposed two-phase framework | Higher task completion |
Key Takeaways
- Hierarchical scene graphs are an effective intermediate representation to ground abstract natural language commands in physical robot environments, enabling LLMs to reason about implicit object/location references
- Separating perception (scene understanding) from planning (LLM reasoning) into distinct phases improves modularity and allows best-in-class models to be used for each sub-task independently
- For practitioners building robotic AI agents, this framework suggests that investing in rich, structured environmental representations fed to LLMs is more effective than expecting LLMs to handle both perception and planning end-to-end
Abstract
Recent advances in large language models (LLMs) have demonstrated exceptional reasoning capabilities in natural language processing, sparking interest in applying LLMs to task planning problems in robotics. Most studies focused on task planning for clear natural language commands that specify target objects and their locations. However, for more user-friendly task execution, it is crucial for robots to autonomously plan and carry out tasks based on abstract natural language commands that may not explicitly mention target objects or locations, such as ‘Put the food ingredients in the same place.’ In this study, we propose an LLM-based autonomous task planning framework that generates task plans for abstract natural language commands. This framework consists of two phases: an environment recognition phase and a task planning phase. In the environment recognition phase, a large vision-language model generates a hierarchical scene graph that captures the relationships between objects and spaces in the environment surrounding a robot agent. During the task planning phase, an LLM uses the scene graph and the abstract user command to formulate a plan for the given task. We validate the effectiveness of the proposed framework in the AI2THOR simulation environment, demonstrating its superior performance in task execution when handling abstract commands.