← Back to Papers

Batch Query Processing and Optimization for Agentic Workflows

Junyi Shen, Noppanat Wadlom, Yao Lu
arXiv.org | 2025
Halo is a system that applies database-style batch query optimization to agentic LLM workflows by representing multi-agent pipelines as DAG query plans and exploiting shared computation across batched queries. It jointly optimizes prefill/decode costs, KV-cache reuse, and CPU-GPU pipelining to dramatically reduce redundancy in batch analytics scenarios.

Problem Statement

Existing LLM serving engines optimize individual calls in isolation, ignoring the cross-call redundancies inherent in multi-agent workflows such as repeated prompts and overlapping contexts. Multi-agent orchestration frameworks focus on logical coordination but lack system-level performance planning, leaving hardware severely underutilized. This mismatch is especially costly in batch analytics workloads where thousands of queries share significant structural overlap.

Key Novelty

  • Structured DAG-based query plan representation for agentic workflows that enables a consolidated computation graph across batched queries to expose and exploit shared sub-computations
  • A joint cost model that simultaneously accounts for heterogeneous resource constraints, prefill vs. decode costs, KV-cache reuse opportunities, and GPU placement decisions to guide plan-level optimization
  • A Processor component combining adaptive batching, KV-cache sharing and migration, and fine-grained CPU-GPU pipelining to maximize holistic hardware efficiency across the full agentic execution stack

Evaluation Highlights

  • Up to 3.6x speedup for batch inference across six benchmarks, scaling to workloads of thousands of queries and complex agent graphs without compromising output quality
  • Up to 2.6x throughput improvement under online serving conditions, demonstrating benefits in both offline batch and real-time deployment scenarios

Breakthrough Assessment

7/10 Halo introduces a genuinely novel cross-layer abstraction that bridges the gap between database query optimization and LLM serving for agentic systems—a gap that has been largely unaddressed. The 3.6x batch speedup and system-level design are significant advances for production agentic workloads, though the core techniques (DAG scheduling, KV caching, batching) are individually well-established.

Methodology

  1. Represent each agentic workflow as a structured query plan DAG capturing agent dependencies, tool calls, and prompt structures; then merge DAGs across a batch of queries into a consolidated graph that makes shared prompt prefixes and reusable context segments explicit
  2. Apply a cost-model-guided plan optimizer that scores candidate execution plans by jointly estimating prefill and decode compute costs, KV-cache reuse savings, memory constraints, and optimal GPU placement, selecting the plan that minimizes total redundant execution
  3. Execute the optimized plan via the Processor, which performs adaptive batching of LLM calls, migrates and shares KV-cache blocks across agents, and pipelines CPU-side tasks (tool execution, orchestration logic) with GPU-side inference to eliminate idle hardware time

System Components

Query Plan DAG Builder

Translates each agentic workflow into a directed acyclic graph of LLM calls, tool invocations, and data dependencies, then constructs a consolidated multi-query graph to surface shared computation opportunities

Cost Model

Jointly estimates execution cost across heterogeneous resources by modeling prefill latency, decode latency, KV-cache hit rates, GPU memory capacity, and placement decisions to guide optimization

Plan-Level Optimizer

Searches over execution plan variants guided by the cost model to minimize redundant LLM calls and maximize cache reuse across the batched workflow graph

Processor

Runtime execution engine that implements adaptive batching of LLM requests, KV-cache block sharing and migration between agents, and fine-grained CPU-GPU pipelining to maximize end-to-end hardware utilization

Results

Metric/Benchmark Baseline (isolated serving) Halo Delta
Batch inference speedup (best case) 1.0x 3.6x +3.6x
Online serving throughput (best case) 1.0x 2.6x +2.6x
Output quality Baseline accuracy Matching accuracy No degradation
Scalability Degrades at scale Handles thousands of queries + complex graphs Qualitative improvement

Key Takeaways

  • Teams building batch agentic pipelines (e.g., document analysis, multi-step data analytics) should consider query-plan-level optimization rather than optimizing individual LLM calls, as cross-call redundancy is a dominant cost factor at scale
  • KV-cache sharing across agents within a workflow is a high-leverage optimization: if multiple agents consume the same system prompt or document context, sharing the KV cache rather than recomputing it can yield substantial speedups without any accuracy trade-off
  • CPU-GPU pipelining between tool execution and LLM inference is an underexplored efficiency lever in agentic systems—overlapping these phases can significantly improve hardware utilization in tool-heavy workflows where GPU would otherwise sit idle

Abstract

Large Language Models (LLMs) in agentic workflows combine multi-step reasoning, heterogeneous tool use, and collaboration across multiple specialized agents. Existing LLM serving engines optimize individual calls in isolation, while multi-agent frameworks focus on orchestration without system-level performance planning. As a result, repeated prompts, overlapping contexts, and fragmented CPU-GPU execution create substantial redundancy and poor hardware utilization, especially in batch analytics scenarios. We introduce Halo, a system that brings batch query processing and optimization into agentic LLM workflows. Halo represents each workflow as a structured query plan DAG and constructs a consolidated graph for batched queries that exposes shared computation. Guided by a cost model that jointly considers heterogeneous resource constraints, prefill and decode costs, cache reuse, and GPU placement, Halo performs plan-level optimization to minimize redundant execution. The Processor integrates adaptive batching, KV-cache sharing and migration, along with fine-grained CPU-GPU pipelining to maximize holistic hardware efficiency. Evaluation across six benchmarks shows that Halo achieves up to 3.6x speedup for batch inference and 2.6x throughput improvement under online serving, scaling to workloads of thousands of queries and complex graphs. These gains are achieved without compromising output quality. By unifying query optimization with heterogeneous LLM serving, Halo enables efficient agentic workflows in data analytics and decision-making applications.

Generated on 2026-03-02 using Claude