AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks
Problem Statement
Existing speech assistants either lack true agentic reasoning with integrated tool use or are closed-source proprietary systems, creating a gap for researchers and developers needing open, reproducible voice AI. No open-source system previously supported full speech-to-speech interaction with multi-turn context, calendar booking, web search, email, and contact lookup in a unified framework. This limits reproducibility, customizability, and community-driven advancement in voice-based AI agents.
Key Novelty
- First open-source, speech-native assistant supporting multi-turn dialogue with dynamic tool invocation across real-world tasks (calendar, email, web search, contact lookup)
- Modular cascaded pipeline architecture combining open-weight ASR, TTS, and LLMs, enabling easy plug-in of new tools via natural language prompts and action classes
- Comprehensive evaluation on VoiceBench (OpenBookQA, AlpacaEval) plus human evaluation on complex multi-turn speech tasks, establishing open-source baselines for voice agents
Evaluation Highlights
- Scored 92.75% on VoiceBench OpenBookQA, outperforming all open-weight systems and approaching GPT-4o performance
- Achieved 4.39 on AlpacaEval (competitive with other open-weight systems) and 90% task success rate in human evaluation on complex multi-turn speech tasks
Breakthrough Assessment
Methodology
- Incoming speech is transcribed using an open-weight ASR model, converting voice input to text while preserving conversational context across turns
- The transcribed text is passed to an open-weight LLM configured with tool-use capabilities; the LLM reasons over conversation history and selects appropriate tools (calendar, email, web search, contact lookup) via natural language prompts and action class definitions
- Tool outputs are returned to the LLM to synthesize a final response, which is then converted to speech via an open-weight TTS model and delivered back to the user, completing the speech-to-speech loop
System Components
Open-weight automatic speech recognition component that converts user voice input to text for downstream processing
Open-weight large language model that maintains multi-turn dialogue context, performs agentic reasoning, and decides which tools to invoke based on user intent
Modular system supporting calendar booking, contact lookup, web search, and email; new tools can be added using natural language prompts and action class definitions
Open-weight text-to-speech component that converts the LLM's final text response back into spoken audio for the user
Coordinates the sequential flow of ASR → LLM (+ tools) → TTS while maintaining session state across multi-turn conversations
Results
| Metric/Benchmark | Best Open-Weight Baseline | AURA (This Paper) | Delta |
|---|---|---|---|
| VoiceBench OpenBookQA (%) | Below 92.75 (all open-weight) | 92.75% | Surpasses all open-weight systems |
| VoiceBench AlpacaEval (score) | Comparable open-weight systems ~4.x | 4.39 | Competitive with peers |
| Human Eval Task Success (%) | Not reported for prior open systems | 90% | Strong absolute performance |
| GPT-4o OpenBookQA (%) | ~GPT-4o level (proprietary ceiling) | 92.75% | Near parity with GPT-4o |
Key Takeaways
- Open-weight ASR+LLM+TTS cascades can nearly match GPT-4o on structured voice QA benchmarks, making open-source voice agents viable for research and deployment without proprietary APIs
- Modular tool integration via natural language prompts lowers the barrier for extending voice agents with new capabilities, enabling rapid prototyping of domain-specific assistants
- Human evaluation with multi-turn, real-world task success metrics (90%) is essential for validating voice agents beyond standard NLP benchmarks, highlighting the need for task-oriented evaluation protocols in the community
Abstract
Despite advances in language and speech technologies, no open-source system enables full speech-to-speech, multi-turn dialogue with integrated tool use and agentic reasoning. We introduce AURA (Agent for Understanding, Reasoning, and Automated Tool Use), the first open-source, speech-native assistant capable of completing complex, goal-driven tasks through dynamic tool invocation and multi-turn conversation. AURA combines open-weight ASR, TTS, and LLMs in a cascaded pipeline and supports tools such as calendar booking, contact lookup, web search, and email. Its modular design allows easy integration of new tools using natural language prompts and action classes. On VoiceBench, AURA scores 92.75% on OpenBookQA-outperforming all open-weight systems and nearing GPT-4o-and 4.39 on AlpacaEval, competitive with other open-weight systems. Human evaluation shows 90% task success on complex, multi-turn speech tasks.