FlowSteer: Interactive Agentic Workflow Orchestration via End-to-End Reinforcement Learning
Problem Statement
Existing agentic workflow orchestration requires high manual effort to construct and tune workflows, is tightly coupled to specific operators or LLM backends limiting generalizability, and suffers from sparse reward signals that make automated learning difficult. These limitations prevent scalable, flexible deployment of agentic systems across diverse tasks and infrastructure choices.
Key Novelty
- End-to-end RL-based workflow orchestration where a lightweight policy model learns to edit and refine workflows through multi-turn interaction with an executable canvas environment
- Canvas Workflow Relative Policy Optimization (CWRPO), a novel training algorithm introducing diversity-constrained rewards with conditional release to stabilize learning and prevent shortcut behaviors
- A plug-and-play architecture supporting interchangeable LLM backends and diverse operator libraries, decoupling the orchestration policy from specific infrastructure dependencies
Evaluation Highlights
- FlowSteer significantly outperforms baselines across twelve diverse datasets spanning various task categories
- The CWRPO training objective demonstrably addresses sparse reward and shortcut behavior issues compared to standard RL training approaches
Breakthrough Assessment
Methodology
- A lightweight policy model observes the current execution state of an agentic workflow canvas and selects editing actions (e.g., adding, removing, or modifying operators) to iteratively construct or refine the workflow
- The executable canvas environment runs the operators specified by the policy, collects execution results, and returns feedback signals to the policy model for the next decision step
- The policy model is trained using CWRPO, which shapes reward signals with diversity constraints and conditional release mechanisms to ensure stable learning, prevent reward hacking, and encourage exploration of valid workflow configurations
System Components
A small, trainable model that serves as the agent, analyzing the current workflow execution state and selecting editing actions to iteratively build or modify the workflow
A sandboxed execution environment that runs operator sequences defined by the policy, returning execution feedback and intermediate results to enable iterative refinement
A custom RL training algorithm that introduces diversity-constrained rewards with conditional release to stabilize training, address sparse rewards, and prevent the policy from exploiting shortcut behaviors
A modular operator registry that allows diverse tool/function operators and interchangeable LLM backends to be swapped in without retraining the core policy
The iterative cycle of policy action selection, canvas execution, and feedback integration that enables progressive workflow construction and refinement
Results
| Metric/Benchmark | Baseline | This Paper | Delta |
|---|---|---|---|
| Average performance across 12 datasets | Existing workflow orchestration methods | Significant outperformance | Substantially higher |
| Shortcut behavior suppression | Standard RL training | CWRPO with diversity constraints | Improved stability |
| Operator/LLM flexibility | Task-specific fixed pipelines | Plug-and-play modular framework | Generalized across backends |
Key Takeaways
- RL-based orchestration with a lightweight policy model can replace expensive manual workflow design, lowering the barrier to deploying complex agentic pipelines at scale
- CWRPO's diversity-constrained reward shaping is a practical technique worth adopting when training agents in environments with sparse or gameable reward signals
- Decoupling the orchestration policy from specific LLMs and operators via a plug-and-play design is critical for real-world deployability, allowing practitioners to upgrade components independently without retraining
Abstract
In recent years, a variety of powerful agentic workflows have been applied to solve a wide range of human problems. However, existing workflow orchestration still faces key challenges, including high manual cost, reliance on specific operators/large language models (LLMs), and sparse reward signals. To address these challenges, we propose FlowSteer, an end-to-end reinforcement learning framework that takes a lightweight policy model as the agent and an executable canvas environment, automating workflow orchestration through multi-turn interaction. In this process, the policy model analyzes execution states and selects editing actions, while the canvas executes operators and returns feedback for iterative refinement. Moreover, FlowSteer provides a plug-and-play framework that supports diverse operator libraries and interchangeable LLM backends. To effectively train this interaction paradigm, we propose Canvas Workflow Relative Policy Optimization (CWRPO), which introduces diversity-constrained rewards with conditional release to stabilize learning and suppress shortcut behaviors. Experimental results on twelve datasets show that FlowSteer significantly outperforms baselines across various tasks.