From Prompt to Action: A Comprehensive Review of LLM Autonomous Agents

This survey provides a comprehensive taxonomy and critical analysis of LLM-based autonomous agents, covering architectures, functional modules, capabilities, and limitations with particular emphasis on safety-critical and adversarial deployment environments.

Problem Statement

As LLM-based agents proliferate across domains from digital assistants to autonomous robots, practitioners lack a unified framework for understanding their construction, benchmarking their reliability, and identifying failure modes. Existing literature lacks focused treatment of deployment challenges in hostile, resource-constrained, or safety-critical environments such as space systems. There is also no standardized methodology for measuring dependability and robustness of LLM agents in adversarial conditions.

Key Novelty

Structured taxonomy of single-agent and multi-agent LLM architectures organized around core functional modules: perception, reasoning, planning, and action
Focused analysis of LLM agent challenges specific to safety-critical and adversarial domains (e.g., space, wireless, remote environments), going beyond general-purpose benchmarks
Proposal of future evaluation standards and metrics for measuring dependability, robustness, and safety of LLM-based autonomous agents in hostile environments

Evaluation Highlights

Qualitative comparison of agent architectures across dimensions: zero-shot generalization, dynamic tool use, human-AI collaboration, and multi-agent coordination
Critical assessment of failure modes including hallucination rates, resource constraints, and safety vulnerabilities across surveyed systems

Breakthrough Assessment

3/10 This is a survey/review paper that synthesizes existing work rather than introducing novel methods or empirical results; its contribution is organizational and contextual, offering incremental value by framing LLM agents through the lens of extreme/space environments.

Methodology

Systematic literature review of recent LLM-based autonomous agent architectures, categorizing them by design pattern (single-agent vs. multi-agent) and functional module (perception, reasoning, planning, action)
Critical analysis of emergent LLM capabilities (zero-shot generalization, tool use, human-AI collaboration) and their real-world limitations (hallucination, compute constraints, safety risks)
Synthesis of open research challenges and proposed evaluation criteria tailored to hostile, wireless, and resource-constrained deployment scenarios

System Components

Perception Module

Handles multimodal input processing, enabling agents to interpret text, sensor data, and environmental signals

Reasoning Module

Leverages LLM capabilities for chain-of-thought inference, decision-making under uncertainty, and contextual understanding

Planning Module

Translates high-level goals into structured action sequences, including task decomposition and dynamic replanning

Action Module

Executes decisions via tool use, API calls, robotic control, or other environment-specific effectors

Multi-Agent Coordination Layer

Manages communication, role assignment, and collaborative task execution across multiple LLM-driven agents

Safety & Robustness Evaluation Framework

Proposed metrics and benchmarks for assessing agent reliability, hallucination resistance, and adversarial robustness

Results

Dimension	Traditional Agents	LLM-Based Agents	Delta
Zero-shot task generalization	Limited, requires retraining	Strong via in-context learning	Significant qualitative improvement
Tool/API chaining	Hardcoded pipelines	Dynamic, prompt-driven	More flexible but less predictable
Hallucination risk	N/A (rule-based)	Present, domain-dependent	New failure mode introduced
Multi-agent coordination	Scripted protocols	Emergent via language	More adaptive, less reliable
Safety in adversarial envs	Deterministic guarantees	Probabilistic, context-sensitive	Gap remains unresolved

Key Takeaways

LLM agents offer strong generalization and tool-use flexibility, but practitioners deploying them in safety-critical systems must explicitly account for hallucination and non-determinism through validation layers and human-in-the-loop mechanisms
Multi-agent LLM architectures enable scalable task decomposition but introduce new coordination failure modes; designing robust communication protocols and fallback strategies is essential for production systems
Current benchmarks inadequately measure robustness and dependability in hostile or resource-constrained environments—researchers should prioritize developing domain-specific evaluation suites before deploying LLM agents in space, defense, or industrial automation contexts

Abstract

Large Language Models (LLMs) have quickly pushed the frontiers of autonomous agents with advanced reasoning, natural language interaction, and tool chaining in complex worlds. With LLM-based agents arising in domains like digital assistants, autonomous robots, and mission planning, it is more critical than ever to have a deep understanding of their construction, strengths, and weaknesses—especially for safety-critical and adversarial domains like space systems. This article presents an overview of the most recent developments in autonomous agents built with LLMs. We categorize modern architectures, single-agent and multi-agent architectures, and their most prominent functional modules—perception, reasoning, planning, and action. We present new functionality facilitated by LLMs, including zero-shot generalization, dynamic tool use, and human-AI collaboration, and criticize their drawbacks in real-world use, e.g., hallucination, limited resources, and safety. Besides, we discuss future standards and metrics for LLM agents, including how to measure dependability and robustness in hostile environments. Lastly, we present open research challenges highlighting the necessity of stable, efficient, and robust LLM-based agents deployable in wireless, remote, and hostile environments. This survey aims to offer researchers and practitioners a brief overview of the status quo with LLM-based autonomous agents and inspire future work bridging current gaps between general-purpose language intelligence and domain-specific autonomous systems.