RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
Problem Statement
RAG has become a critical technique for reliable LLM deployment, but efficient serving remains unsolved due to the proliferation of diverse RAG variants with vastly different computational profiles. Existing LLM serving systems treat RAG as an extension rather than a first-class workload, leading to suboptimal resource utilization and performance. The lack of a unified abstraction makes it difficult to reason about and optimize across the broad landscape of RAG algorithms.
Key Novelty
- RAGSchema: A structured abstraction that formally captures the diverse space of RAG algorithms, enabling systematic performance analysis and optimization across variants
- Empirical workload characterization revealing significant performance variability across representative RAG patterns, exposing gaps in current serving infrastructure
- RAGO: A system optimization framework that leverages RAGSchema to apply targeted resource allocation and scheduling strategies tailored to specific RAG workload characteristics
Evaluation Highlights
- Up to 2× increase in queries per second (QPS) per chip compared to RAG systems built as extensions of standard LLM serving systems
- Up to 55% reduction in time-to-first-token (TTFT) latency over baseline LLM-system-extension-based RAG deployments
Breakthrough Assessment
Methodology
- Define RAGSchema as a structured abstraction encoding key dimensions of RAG algorithms (e.g., retrieval frequency, granularity, fusion strategy, pipeline topology) to unify the diverse RAG design space
- Profile and characterize representative RAG workloads using RAGSchema, measuring compute/memory/latency bottlenecks to expose workload-specific performance variability
- Design RAGO optimizer that uses RAGSchema metadata to apply workload-aware system optimizations including resource partitioning, operator scheduling, and batching strategies for end-to-end RAG pipelines
System Components
A structured abstraction layer that formally describes the algorithmic configuration of RAG systems, capturing retrieval patterns, generation coupling, and pipeline structure to serve as input to the optimizer
Profiling and analysis module that benchmarks representative RAG workloads under RAGSchema, identifying performance bottlenecks and variability across retrieval and generation stages
System optimization framework that maps RAGSchema descriptions to hardware-aware serving strategies, optimizing resource allocation, request scheduling, and pipeline execution for diverse RAG workloads
Execution backend that implements RAGO's optimization decisions, coordinating retrieval (vector search) and generation (LLM inference) components efficiently on target hardware
Results
| Metric | Baseline (LLM-extension RAG) | RAGO | Delta |
|---|---|---|---|
| QPS per chip (throughput) | 1× | Up to 2× | +100% improvement |
| Time-to-First-Token (TTFT) latency | Baseline TTFT | 0.45× baseline TTFT | -55% reduction |
Key Takeaways
- Treating RAG as a first-class serving workload rather than an LLM extension yields substantial gains—ML engineers building production RAG systems should invest in RAG-specific serving infrastructure rather than adapting generic LLM servers
- The diversity of RAG variants (different retrieval frequencies, fusion strategies, pipeline topologies) means no single serving configuration is optimal; workload characterization before deployment is essential for performance tuning
- RAGSchema-style abstractions provide a practical blueprint for building configurable RAG serving systems that can adapt optimization strategies to specific use-case requirements, which is especially valuable as RAG architectures continue to evolve rapidly
Abstract
Retrieval-augmented generation (RAG) is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. This paper makes three fundamental contributions to advancing RAG serving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across these workloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. RAGO achieves up to a 2 × increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.