Large Language Model Based Autonomous Task Planning for Abstract Commands

This paper proposes a two-phase LLM-based framework for autonomous robotic task planning that can interpret and execute abstract natural language commands (e.g., 'Put the food ingredients in the same place') without explicit object or location specifications, using a hierarchical scene graph for environmental grounding.

Problem Statement

Existing LLM-based task planning for robots typically requires explicit, concrete commands that specify target objects and their locations, limiting usability for everyday users who communicate abstractly. Abstract commands require robots to autonomously infer task-relevant objects and spatial relationships from their environment, which prior systems cannot handle. This gap prevents practical deployment of robot assistants in unstructured real-world settings.

Key Novelty

Two-phase framework separating environment recognition from task planning, enabling robots to handle underspecified, abstract commands
Use of a large vision-language model (VLM) to generate a hierarchical scene graph that encodes object-space relationships for downstream LLM reasoning
Demonstration that structured environmental representations (scene graphs) can bridge the gap between abstract user intent and concrete executable robot actions

Evaluation Highlights

Framework validated in AI2THOR simulation environment, showing superior task execution performance on abstract commands compared to baseline approaches
Qualitative and quantitative comparisons demonstrate improved success rates in interpreting and executing tasks where target objects/locations are not explicitly named

Breakthrough Assessment

5/10 The paper makes a solid and practically relevant contribution by combining VLM-based scene graph generation with LLM task planning for abstract commands, but the individual components (LLMs, VLMs, scene graphs) are well-established; the novelty lies primarily in their integration and application to this underexplored problem setting.

Methodology

Environment Recognition Phase: A vision-language model perceives the robot's surroundings and constructs a hierarchical scene graph capturing spatial relationships and object categories across the environment
Task Planning Phase: An LLM ingests the hierarchical scene graph alongside the abstract natural language command to reason about which objects and locations are relevant and generates a structured task plan
Execution & Validation: The generated task plan is executed in the AI2THOR simulation environment, and success is measured against ground-truth task completion criteria for abstract command scenarios

System Components

Hierarchical Scene Graph Generator

A large vision-language model that processes visual observations of the robot's environment and outputs a structured graph representing objects, their attributes, and spatial/semantic relationships to surrounding spaces

Abstract Command Interpreter (LLM Planner)

An LLM that takes the scene graph and an abstract user command as input and produces a concrete, step-by-step task plan by inferring relevant objects and target locations

AI2THOR Simulation Environment

Used as the validation testbed, providing a realistic indoor 3D environment where robot agents can execute and evaluate the generated task plans

Results

Metric/Benchmark	Baseline (Explicit Commands / Prior LLM Planning)	This Paper (Abstract Command Framework)	Delta
Task Success Rate on Abstract Commands	Lower / not designed for abstract input	Superior performance (specific numbers not reported in abstract)	Positive improvement
Handling of Underspecified Object Targets	Fails or requires clarification	Autonomous resolution via scene graph	Qualitative improvement
Environment: AI2THOR Simulation	Standard LLM task planners	Proposed two-phase framework	Higher task completion

Key Takeaways

Hierarchical scene graphs are an effective intermediate representation to ground abstract natural language commands in physical robot environments, enabling LLMs to reason about implicit object/location references
Separating perception (scene understanding) from planning (LLM reasoning) into distinct phases improves modularity and allows best-in-class models to be used for each sub-task independently
For practitioners building robotic AI agents, this framework suggests that investing in rich, structured environmental representations fed to LLMs is more effective than expecting LLMs to handle both perception and planning end-to-end

Abstract

Recent advances in large language models (LLMs) have demonstrated exceptional reasoning capabilities in natural language processing, sparking interest in applying LLMs to task planning problems in robotics. Most studies focused on task planning for clear natural language commands that specify target objects and their locations. However, for more user-friendly task execution, it is crucial for robots to autonomously plan and carry out tasks based on abstract natural language commands that may not explicitly mention target objects or locations, such as ‘Put the food ingredients in the same place.’ In this study, we propose an LLM-based autonomous task planning framework that generates task plans for abstract natural language commands. This framework consists of two phases: an environment recognition phase and a task planning phase. In the environment recognition phase, a large vision-language model generates a hierarchical scene graph that captures the relationships between objects and spaces in the environment surrounding a robot agent. During the task planning phase, an LLM uses the scene graph and the abstract user command to formulate a plan for the given task. We validate the effectiveness of the proposed framework in the AI2THOR simulation environment, demonstrating its superior performance in task execution when handling abstract commands.