GFlowPO: Generative Flow Network as a Language Model Prompt Optimizer

GFlowPO reframes prompt optimization as a posterior inference problem, using off-policy Generative Flow Networks with replay-based training and a dynamic meta-prompt update mechanism to achieve sample-efficient, high-quality prompt discovery for language models.

Problem Statement

Automatic prompt optimization is critical for LLM performance but faces a combinatorially large search space with sparse rewards due to costly target-LM evaluations. Existing RL-based prompt optimizers suffer from poor sample efficiency because they rely on on-policy updates and sample from a fixed meta-prompt distribution. This limits exploration diversity and wastes expensive evaluations by not reusing past prompt-reward data.

Key Novelty

Casting prompt optimization as posterior inference over latent prompts regularized by a meta-prompted reference-LM prior, enabling principled probabilistic search
Off-policy GFlowNet training objective with a replay buffer that reuses past prompt evaluations, dramatically improving sample efficiency over on-policy RL baselines
Dynamic Memory Update (DMU): a training-free mechanism that progressively refines the meta-prompt by injecting diverse replay buffer prompts and top-performing prompts from a priority queue

Evaluation Highlights

GFlowPO consistently outperforms recent discrete prompt optimization baselines across few-shot text classification, instruction induction benchmarks, and question answering tasks
The framework demonstrates improved sample efficiency by reusing past evaluations via replay, reducing the number of expensive target-LM calls needed to find high-reward prompts

Signal Assessment

6/10 GFlowPO is a solid methodological contribution that creatively applies GFlowNets to prompt optimization with a practical replay-based training scheme, but it remains an incremental advance within the prompt optimization subfield rather than a paradigm shift for LLMs broadly.

Methodology

Step 1 - Problem Formulation: Frame prompt search as posterior inference over latent prompts, where a lightweight prompt-LM is fine-tuned to generate prompts regularized by a meta-prompted reference-LM prior
Step 2 - Off-Policy GFlowNet Training: Fine-tune the prompt-LM using a GFlowNet objective with a replay buffer that stores past (prompt, reward) pairs, enabling off-policy updates that reuse expensive evaluations for sample-efficient exploration
Step 3 - Dynamic Memory Update (DMU): At inference/search time, iteratively update the meta-prompt by injecting diverse prompts from the replay buffer and top-performing prompts from a priority queue to progressively focus search on high-reward regions

System Components

Prompt-LM

A lightweight language model fine-tuned with GFlowNet objectives to generate candidate prompts as a learned posterior distribution

GFlowNet Objective

Off-policy generative flow network training loss that trains the prompt-LM to generate prompts with probability proportional to their reward, enabling diverse yet high-quality exploration

Replay Buffer

Stores previously evaluated (prompt, reward) pairs for off-policy reuse, improving sample efficiency by avoiding redundant target-LM evaluations

Dynamic Memory Update (DMU)

Training-free meta-prompt update mechanism that injects diverse and top-performing prompts into the meta-prompt context, progressively steering generation toward high-reward regions

Priority Queue

Maintains the top-performing prompts found so far for inclusion in the DMU, ensuring the meta-prompt always reflects the best-known solutions

Results

Benchmark	Best Baseline	GFlowPO	Delta
Few-shot Text Classification	Competitive SOTA baseline	Consistently higher accuracy	Positive improvement
Instruction Induction	Competitive SOTA baseline	Consistently higher accuracy	Positive improvement
Question Answering	Competitive SOTA baseline	Consistently higher accuracy	Positive improvement
Sample Efficiency	On-policy RL (no replay)	Improved via replay buffer reuse	Fewer target-LM calls needed

Key Takeaways

Practitioners can adopt GFlowPO to reduce the number of expensive target-LM evaluations needed for prompt optimization by leveraging replay buffers with off-policy GFlowNet training
The Dynamic Memory Update mechanism is training-free and can potentially be integrated as a plug-in enhancement to other prompt optimization systems without retraining
Framing prompt search as posterior inference (rather than pure reward maximization) encourages diversity in generated prompts, which is crucial for escaping local optima in the large discrete prompt space

Abstract

Finding effective prompts for language models (LMs) is critical yet notoriously difficult: the prompt space is combinatorially large, rewards are sparse due to expensive target-LM evaluation. Yet, existing RL-based prompt optimizers often rely on on-policy updates and a meta-prompt sampled from a fixed distribution, leading to poor sample efficiency. We propose GFlowPO, a probabilistic prompt optimization framework that casts prompt search as a posterior inference problem over latent prompts regularized by a meta-prompted reference-LM prior. In the first step, we fine-tune a lightweight prompt-LM with an off-policy Generative Flow Network (GFlowNet) objective, using a replay-based training policy that reuses past prompt evaluations to enable sample-efficient exploration. In the second step, we introduce Dynamic Memory Update (DMU), a training-free mechanism that updates the meta-prompt by injecting both (i) diverse prompts from a replay buffer and (ii) top-performing prompts from a small priority queue, thereby progressively concentrating the search process on high-reward regions. Across few-shot text classification, instruction induction benchmarks, and question answering tasks, GFlowPO consistently outperforms recent discrete prompt optimization baselines.