RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving

RAGO introduces a systematic framework for optimizing Retrieval-Augmented Generation serving by abstracting RAG algorithm variants through RAGSchema and applying workload-aware system optimizations to improve throughput and latency.

Problem Statement

RAG has become a critical technique for reliable LLM deployment, but efficient serving remains unsolved due to the proliferation of diverse RAG variants with vastly different computational profiles. Existing LLM serving systems treat RAG as an extension rather than a first-class workload, leading to suboptimal resource utilization and performance. The lack of a unified abstraction makes it difficult to reason about and optimize across the broad landscape of RAG algorithms.

Key Novelty

RAGSchema: A structured abstraction that formally captures the diverse space of RAG algorithms, enabling systematic performance analysis and optimization across variants
Empirical workload characterization revealing significant performance variability across representative RAG patterns, exposing gaps in current serving infrastructure
RAGO: A system optimization framework that leverages RAGSchema to apply targeted resource allocation and scheduling strategies tailored to specific RAG workload characteristics

Evaluation Highlights

Up to 2× increase in queries per second (QPS) per chip compared to RAG systems built as extensions of standard LLM serving systems
Up to 55% reduction in time-to-first-token (TTFT) latency over baseline LLM-system-extension-based RAG deployments

Breakthrough Assessment

7/10 RAGO makes a significant systems contribution by being one of the first principled frameworks to treat RAG serving as a first-class optimization problem at the architecture level (ISCA venue), with meaningful performance gains; however, it is primarily an engineering/systems advance rather than a fundamental algorithmic breakthrough.

Methodology

Define RAGSchema as a structured abstraction encoding key dimensions of RAG algorithms (e.g., retrieval frequency, granularity, fusion strategy, pipeline topology) to unify the diverse RAG design space
Profile and characterize representative RAG workloads using RAGSchema, measuring compute/memory/latency bottlenecks to expose workload-specific performance variability
Design RAGO optimizer that uses RAGSchema metadata to apply workload-aware system optimizations including resource partitioning, operator scheduling, and batching strategies for end-to-end RAG pipelines

System Components

RAGSchema

A structured abstraction layer that formally describes the algorithmic configuration of RAG systems, capturing retrieval patterns, generation coupling, and pipeline structure to serve as input to the optimizer

Workload Characterization Engine

Profiling and analysis module that benchmarks representative RAG workloads under RAGSchema, identifying performance bottlenecks and variability across retrieval and generation stages

RAGO Optimizer

System optimization framework that maps RAGSchema descriptions to hardware-aware serving strategies, optimizing resource allocation, request scheduling, and pipeline execution for diverse RAG workloads

RAG Serving Runtime

Execution backend that implements RAGO's optimization decisions, coordinating retrieval (vector search) and generation (LLM inference) components efficiently on target hardware

Results

Metric	Baseline (LLM-extension RAG)	RAGO	Delta
QPS per chip (throughput)	1×	Up to 2×	+100% improvement
Time-to-First-Token (TTFT) latency	Baseline TTFT	0.45× baseline TTFT	-55% reduction

Key Takeaways

Treating RAG as a first-class serving workload rather than an LLM extension yields substantial gains—ML engineers building production RAG systems should invest in RAG-specific serving infrastructure rather than adapting generic LLM servers
The diversity of RAG variants (different retrieval frequencies, fusion strategies, pipeline topologies) means no single serving configuration is optimal; workload characterization before deployment is essential for performance tuning
RAGSchema-style abstractions provide a practical blueprint for building configurable RAG serving systems that can adapt optimization strategies to specific use-case requirements, which is especially valuable as RAG architectures continue to evolve rapidly

Abstract

Retrieval-augmented generation (RAG) is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. This paper makes three fundamental contributions to advancing RAG serving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across these workloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. RAGO achieves up to a 2 × increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.