AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks

AURA is the first open-source, speech-native conversational agent that integrates ASR, TTS, and LLMs in a cascaded pipeline to support multi-turn dialogue with dynamic tool invocation for complex goal-driven tasks. It demonstrates that open-weight components can approach proprietary system performance on voice benchmarks while enabling real-world agentic capabilities.

Problem Statement

Existing speech assistants either lack true agentic reasoning with integrated tool use or are closed-source proprietary systems, creating a gap for researchers and developers needing open, reproducible voice AI. No open-source system previously supported full speech-to-speech interaction with multi-turn context, calendar booking, web search, email, and contact lookup in a unified framework. This limits reproducibility, customizability, and community-driven advancement in voice-based AI agents.

Key Novelty

First open-source, speech-native assistant supporting multi-turn dialogue with dynamic tool invocation across real-world tasks (calendar, email, web search, contact lookup)
Modular cascaded pipeline architecture combining open-weight ASR, TTS, and LLMs, enabling easy plug-in of new tools via natural language prompts and action classes
Comprehensive evaluation on VoiceBench (OpenBookQA, AlpacaEval) plus human evaluation on complex multi-turn speech tasks, establishing open-source baselines for voice agents

Evaluation Highlights

Scored 92.75% on VoiceBench OpenBookQA, outperforming all open-weight systems and approaching GPT-4o performance
Achieved 4.39 on AlpacaEval (competitive with other open-weight systems) and 90% task success rate in human evaluation on complex multi-turn speech tasks

Breakthrough Assessment

6/10 AURA is a solid and practically valuable contribution that meaningfully advances the open-source voice agent ecosystem, but it is primarily an integration and systems engineering effort combining existing open-weight components rather than introducing fundamentally new modeling techniques or training paradigms.

Methodology

Incoming speech is transcribed using an open-weight ASR model, converting voice input to text while preserving conversational context across turns
The transcribed text is passed to an open-weight LLM configured with tool-use capabilities; the LLM reasons over conversation history and selects appropriate tools (calendar, email, web search, contact lookup) via natural language prompts and action class definitions
Tool outputs are returned to the LLM to synthesize a final response, which is then converted to speech via an open-weight TTS model and delivered back to the user, completing the speech-to-speech loop

System Components

ASR Module

Open-weight automatic speech recognition component that converts user voice input to text for downstream processing

LLM Reasoning Core

Open-weight large language model that maintains multi-turn dialogue context, performs agentic reasoning, and decides which tools to invoke based on user intent

Tool Integration Layer

Modular system supporting calendar booking, contact lookup, web search, and email; new tools can be added using natural language prompts and action class definitions

TTS Module

Open-weight text-to-speech component that converts the LLM's final text response back into spoken audio for the user

Cascaded Pipeline Orchestrator

Coordinates the sequential flow of ASR → LLM (+ tools) → TTS while maintaining session state across multi-turn conversations

Results

Metric/Benchmark	Best Open-Weight Baseline	AURA (This Paper)	Delta
VoiceBench OpenBookQA (%)	Below 92.75 (all open-weight)	92.75%	Surpasses all open-weight systems
VoiceBench AlpacaEval (score)	Comparable open-weight systems ~4.x	4.39	Competitive with peers
Human Eval Task Success (%)	Not reported for prior open systems	90%	Strong absolute performance
GPT-4o OpenBookQA (%)	~GPT-4o level (proprietary ceiling)	92.75%	Near parity with GPT-4o

Key Takeaways

Open-weight ASR+LLM+TTS cascades can nearly match GPT-4o on structured voice QA benchmarks, making open-source voice agents viable for research and deployment without proprietary APIs
Modular tool integration via natural language prompts lowers the barrier for extending voice agents with new capabilities, enabling rapid prototyping of domain-specific assistants
Human evaluation with multi-turn, real-world task success metrics (90%) is essential for validating voice agents beyond standard NLP benchmarks, highlighting the need for task-oriented evaluation protocols in the community

Abstract

Despite advances in language and speech technologies, no open-source system enables full speech-to-speech, multi-turn dialogue with integrated tool use and agentic reasoning. We introduce AURA (Agent for Understanding, Reasoning, and Automated Tool Use), the first open-source, speech-native assistant capable of completing complex, goal-driven tasks through dynamic tool invocation and multi-turn conversation. AURA combines open-weight ASR, TTS, and LLMs in a cascaded pipeline and supports tools such as calendar booking, contact lookup, web search, and email. Its modular design allows easy integration of new tools using natural language prompts and action classes. On VoiceBench, AURA scores 92.75% on OpenBookQA-outperforming all open-weight systems and nearing GPT-4o-and 4.39 on AlpacaEval, competitive with other open-weight systems. Human evaluation shows 90% task success on complex, multi-turn speech tasks.