GFlowPO: Generative Flow Network as a Language Model Prompt Optimizer
Problem Statement
Automatic prompt optimization is critical for LLM performance but faces a combinatorially large search space with sparse rewards due to costly target-LM evaluations. Existing RL-based prompt optimizers suffer from poor sample efficiency because they rely on on-policy updates and sample from a fixed meta-prompt distribution. This limits exploration diversity and wastes expensive evaluations by not reusing past prompt-reward data.
Key Novelty
- Casting prompt optimization as posterior inference over latent prompts regularized by a meta-prompted reference-LM prior, enabling principled probabilistic search
- Off-policy GFlowNet training objective with a replay buffer that reuses past prompt evaluations, dramatically improving sample efficiency over on-policy RL baselines
- Dynamic Memory Update (DMU): a training-free mechanism that progressively refines the meta-prompt by injecting diverse replay buffer prompts and top-performing prompts from a priority queue
Evaluation Highlights
- GFlowPO consistently outperforms recent discrete prompt optimization baselines across few-shot text classification, instruction induction benchmarks, and question answering tasks
- The framework demonstrates improved sample efficiency by reusing past evaluations via replay, reducing the number of expensive target-LM calls needed to find high-reward prompts
Breakthrough Assessment
Methodology
- Step 1 - Problem Formulation: Frame prompt search as posterior inference over latent prompts, where a lightweight prompt-LM is fine-tuned to generate prompts regularized by a meta-prompted reference-LM prior
- Step 2 - Off-Policy GFlowNet Training: Fine-tune the prompt-LM using a GFlowNet objective with a replay buffer that stores past (prompt, reward) pairs, enabling off-policy updates that reuse expensive evaluations for sample-efficient exploration
- Step 3 - Dynamic Memory Update (DMU): At inference/search time, iteratively update the meta-prompt by injecting diverse prompts from the replay buffer and top-performing prompts from a priority queue to progressively focus search on high-reward regions
System Components
A lightweight language model fine-tuned with GFlowNet objectives to generate candidate prompts as a learned posterior distribution
Off-policy generative flow network training loss that trains the prompt-LM to generate prompts with probability proportional to their reward, enabling diverse yet high-quality exploration
Stores previously evaluated (prompt, reward) pairs for off-policy reuse, improving sample efficiency by avoiding redundant target-LM evaluations
Training-free meta-prompt update mechanism that injects diverse and top-performing prompts into the meta-prompt context, progressively steering generation toward high-reward regions
Maintains the top-performing prompts found so far for inclusion in the DMU, ensuring the meta-prompt always reflects the best-known solutions
Results
| Benchmark | Best Baseline | GFlowPO | Delta |
|---|---|---|---|
| Few-shot Text Classification | Competitive SOTA baseline | Consistently higher accuracy | Positive improvement |
| Instruction Induction | Competitive SOTA baseline | Consistently higher accuracy | Positive improvement |
| Question Answering | Competitive SOTA baseline | Consistently higher accuracy | Positive improvement |
| Sample Efficiency | On-policy RL (no replay) | Improved via replay buffer reuse | Fewer target-LM calls needed |
Key Takeaways
- Practitioners can adopt GFlowPO to reduce the number of expensive target-LM evaluations needed for prompt optimization by leveraging replay buffers with off-policy GFlowNet training
- The Dynamic Memory Update mechanism is training-free and can potentially be integrated as a plug-in enhancement to other prompt optimization systems without retraining
- Framing prompt search as posterior inference (rather than pure reward maximization) encourages diversity in generated prompts, which is crucial for escaping local optima in the large discrete prompt space
Abstract
Finding effective prompts for language models (LMs) is critical yet notoriously difficult: the prompt space is combinatorially large, rewards are sparse due to expensive target-LM evaluation. Yet, existing RL-based prompt optimizers often rely on on-policy updates and a meta-prompt sampled from a fixed distribution, leading to poor sample efficiency. We propose GFlowPO, a probabilistic prompt optimization framework that casts prompt search as a posterior inference problem over latent prompts regularized by a meta-prompted reference-LM prior. In the first step, we fine-tune a lightweight prompt-LM with an off-policy Generative Flow Network (GFlowNet) objective, using a replay-based training policy that reuses past prompt evaluations to enable sample-efficient exploration. In the second step, we introduce Dynamic Memory Update (DMU), a training-free mechanism that updates the meta-prompt by injecting both (i) diverse prompts from a replay buffer and (ii) top-performing prompts from a small priority queue, thereby progressively concentrating the search process on high-reward regions. Across few-shot text classification, instruction induction benchmarks, and question answering tasks, GFlowPO consistently outperforms recent discrete prompt optimization baselines.