FlowSteer: Interactive Agentic Workflow Orchestration via End-to-End Reinforcement Learning

FlowSteer is an end-to-end reinforcement learning framework that uses a lightweight policy model to automate agentic workflow orchestration through iterative interaction with an executable canvas environment, eliminating the need for manual workflow design.

Problem Statement

Existing agentic workflow orchestration requires high manual effort to construct and tune workflows, is tightly coupled to specific operators or LLM backends limiting generalizability, and suffers from sparse reward signals that make automated learning difficult. These limitations prevent scalable, flexible deployment of agentic systems across diverse tasks and infrastructure choices.

Key Novelty

End-to-end RL-based workflow orchestration where a lightweight policy model learns to edit and refine workflows through multi-turn interaction with an executable canvas environment
Canvas Workflow Relative Policy Optimization (CWRPO), a novel training algorithm introducing diversity-constrained rewards with conditional release to stabilize learning and prevent shortcut behaviors
A plug-and-play architecture supporting interchangeable LLM backends and diverse operator libraries, decoupling the orchestration policy from specific infrastructure dependencies

Evaluation Highlights

FlowSteer significantly outperforms baselines across twelve diverse datasets spanning various task categories
The CWRPO training objective demonstrably addresses sparse reward and shortcut behavior issues compared to standard RL training approaches

Signal Assessment

7/10 FlowSteer meaningfully advances the field by combining end-to-end RL with agentic workflow orchestration in a modular, plug-and-play fashion, tackling real practical bottlenecks; however, the core ideas (RL for agent training, iterative refinement) are extensions of existing paradigms rather than a fundamental paradigm shift.

Methodology

A lightweight policy model observes the current execution state of an agentic workflow canvas and selects editing actions (e.g., adding, removing, or modifying operators) to iteratively construct or refine the workflow
The executable canvas environment runs the operators specified by the policy, collects execution results, and returns feedback signals to the policy model for the next decision step
The policy model is trained using CWRPO, which shapes reward signals with diversity constraints and conditional release mechanisms to ensure stable learning, prevent reward hacking, and encourage exploration of valid workflow configurations

System Components

Lightweight Policy Model

A small, trainable model that serves as the agent, analyzing the current workflow execution state and selecting editing actions to iteratively build or modify the workflow

Executable Canvas Environment

A sandboxed execution environment that runs operator sequences defined by the policy, returning execution feedback and intermediate results to enable iterative refinement

CWRPO (Canvas Workflow Relative Policy Optimization)

A custom RL training algorithm that introduces diversity-constrained rewards with conditional release to stabilize training, address sparse rewards, and prevent the policy from exploiting shortcut behaviors

Plug-and-Play Operator Library

A modular operator registry that allows diverse tool/function operators and interchangeable LLM backends to be swapped in without retraining the core policy

Multi-turn Interaction Loop

The iterative cycle of policy action selection, canvas execution, and feedback integration that enables progressive workflow construction and refinement

Results

Metric/Benchmark	Baseline	This Paper	Delta
Average performance across 12 datasets	Existing workflow orchestration methods	Significant outperformance	Substantially higher
Shortcut behavior suppression	Standard RL training	CWRPO with diversity constraints	Improved stability
Operator/LLM flexibility	Task-specific fixed pipelines	Plug-and-play modular framework	Generalized across backends

Key Takeaways

RL-based orchestration with a lightweight policy model can replace expensive manual workflow design, lowering the barrier to deploying complex agentic pipelines at scale
CWRPO's diversity-constrained reward shaping is a practical technique worth adopting when training agents in environments with sparse or gameable reward signals
Decoupling the orchestration policy from specific LLMs and operators via a plug-and-play design is critical for real-world deployability, allowing practitioners to upgrade components independently without retraining

Abstract

In recent years, a variety of powerful agentic workflows have been applied to solve a wide range of human problems. However, existing workflow orchestration still faces key challenges, including high manual cost, reliance on specific operators/large language models (LLMs), and sparse reward signals. To address these challenges, we propose FlowSteer, an end-to-end reinforcement learning framework that takes a lightweight policy model as the agent and an executable canvas environment, automating workflow orchestration through multi-turn interaction. In this process, the policy model analyzes execution states and selects editing actions, while the canvas executes operators and returns feedback for iterative refinement. Moreover, FlowSteer provides a plug-and-play framework that supports diverse operator libraries and interchangeable LLM backends. To effectively train this interaction paradigm, we propose Canvas Workflow Relative Policy Optimization (CWRPO), which introduces diversity-constrained rewards with conditional release to stabilize learning and suppress shortcut behaviors. Experimental results on twelve datasets show that FlowSteer significantly outperforms baselines across various tasks.