Batch Query Processing and Optimization for Agentic Workflows
Problem Statement
Existing LLM serving engines optimize individual calls in isolation, ignoring the cross-call redundancies inherent in multi-agent workflows such as repeated prompts and overlapping contexts. Multi-agent orchestration frameworks focus on logical coordination but lack system-level performance planning, leaving hardware severely underutilized. This mismatch is especially costly in batch analytics workloads where thousands of queries share significant structural overlap.
Key Novelty
- Structured DAG-based query plan representation for agentic workflows that enables a consolidated computation graph across batched queries to expose and exploit shared sub-computations
- A joint cost model that simultaneously accounts for heterogeneous resource constraints, prefill vs. decode costs, KV-cache reuse opportunities, and GPU placement decisions to guide plan-level optimization
- A Processor component combining adaptive batching, KV-cache sharing and migration, and fine-grained CPU-GPU pipelining to maximize holistic hardware efficiency across the full agentic execution stack
Evaluation Highlights
- Up to 3.6x speedup for batch inference across six benchmarks, scaling to workloads of thousands of queries and complex agent graphs without compromising output quality
- Up to 2.6x throughput improvement under online serving conditions, demonstrating benefits in both offline batch and real-time deployment scenarios
Breakthrough Assessment
Methodology
- Represent each agentic workflow as a structured query plan DAG capturing agent dependencies, tool calls, and prompt structures; then merge DAGs across a batch of queries into a consolidated graph that makes shared prompt prefixes and reusable context segments explicit
- Apply a cost-model-guided plan optimizer that scores candidate execution plans by jointly estimating prefill and decode compute costs, KV-cache reuse savings, memory constraints, and optimal GPU placement, selecting the plan that minimizes total redundant execution
- Execute the optimized plan via the Processor, which performs adaptive batching of LLM calls, migrates and shares KV-cache blocks across agents, and pipelines CPU-side tasks (tool execution, orchestration logic) with GPU-side inference to eliminate idle hardware time
System Components
Translates each agentic workflow into a directed acyclic graph of LLM calls, tool invocations, and data dependencies, then constructs a consolidated multi-query graph to surface shared computation opportunities
Jointly estimates execution cost across heterogeneous resources by modeling prefill latency, decode latency, KV-cache hit rates, GPU memory capacity, and placement decisions to guide optimization
Searches over execution plan variants guided by the cost model to minimize redundant LLM calls and maximize cache reuse across the batched workflow graph
Runtime execution engine that implements adaptive batching of LLM requests, KV-cache block sharing and migration between agents, and fine-grained CPU-GPU pipelining to maximize end-to-end hardware utilization
Results
| Metric/Benchmark | Baseline (isolated serving) | Halo | Delta |
|---|---|---|---|
| Batch inference speedup (best case) | 1.0x | 3.6x | +3.6x |
| Online serving throughput (best case) | 1.0x | 2.6x | +2.6x |
| Output quality | Baseline accuracy | Matching accuracy | No degradation |
| Scalability | Degrades at scale | Handles thousands of queries + complex graphs | Qualitative improvement |
Key Takeaways
- Teams building batch agentic pipelines (e.g., document analysis, multi-step data analytics) should consider query-plan-level optimization rather than optimizing individual LLM calls, as cross-call redundancy is a dominant cost factor at scale
- KV-cache sharing across agents within a workflow is a high-leverage optimization: if multiple agents consume the same system prompt or document context, sharing the KV cache rather than recomputing it can yield substantial speedups without any accuracy trade-off
- CPU-GPU pipelining between tool execution and LLM inference is an underexplored efficiency lever in agentic systems—overlapping these phases can significantly improve hardware utilization in tool-heavy workflows where GPU would otherwise sit idle
Abstract
Large Language Models (LLMs) in agentic workflows combine multi-step reasoning, heterogeneous tool use, and collaboration across multiple specialized agents. Existing LLM serving engines optimize individual calls in isolation, while multi-agent frameworks focus on orchestration without system-level performance planning. As a result, repeated prompts, overlapping contexts, and fragmented CPU-GPU execution create substantial redundancy and poor hardware utilization, especially in batch analytics scenarios. We introduce Halo, a system that brings batch query processing and optimization into agentic LLM workflows. Halo represents each workflow as a structured query plan DAG and constructs a consolidated graph for batched queries that exposes shared computation. Guided by a cost model that jointly considers heterogeneous resource constraints, prefill and decode costs, cache reuse, and GPU placement, Halo performs plan-level optimization to minimize redundant execution. The Processor integrates adaptive batching, KV-cache sharing and migration, along with fine-grained CPU-GPU pipelining to maximize holistic hardware efficiency. Evaluation across six benchmarks shows that Halo achieves up to 3.6x speedup for batch inference and 2.6x throughput improvement under online serving, scaling to workloads of thousands of queries and complex graphs. These gains are achieved without compromising output quality. By unifying query optimization with heterogeneous LLM serving, Halo enables efficient agentic workflows in data analytics and decision-making applications.