PromptRL: Prompt Matters in RL for Flow-Based Image Generation
π TL;DR
We present PromptRL, a framework that jointly trains language models and flow-matching models within a unified RL loop. PromptRL achieves 0.97 on GenEval, 0.98 on OCR accuracy, and 24.05 on PickScore, while requiring over 2Γ fewer rollouts than flow-only RL methods.
π Motivation: Two Critical Failure Modes in Flow-Based RL
Reinforcement learning has become the standard post-training mechanism for aligning text-to-image flow matching models with human preferences. However, our investigation reveals two underappreciated yet critical failure modes that severely undermine current RL pipelines:
Failure Mode 1: The Quality-Diversity Dilemma
We observe a fundamental tension between generation quality and output diversity. As T2I models advance in their capacity to precisely follow textual prompts, they simultaneously sacrifice the generative variability essential for effective RL exploration.
| Model | Text-Image Sim β | PickScore β | Image-Image Sim β |
|---|---|---|---|
| SD v1-5 | 0.28β0.29 | 18.9β19.4 | 0.58β0.72 |
| FLUX.1-dev | 0.32β0.35 | 21.9β23.0 | 0.92β0.93 |
FLUX.1-dev achieves notably higher aesthetic scores, yet generates outputs with dramatically reduced diversity (II-Sim of 0.92β0.93 vs. 0.58β0.72 for SD v1-5). This exploration bottleneck is critical: when all samples cluster around similar high-quality outputs, advantage estimators lose the comparative information necessary for policy improvement.
Failure Mode 2: Prompt Linguistic Hacking
Beyond the quality-diversity dilemma, we identify severe prompt overfitting, where RL-trained models exploit superficial lexical patterns rather than developing robust semantic understanding. We evaluate this by testing models on both original prompts and semantically-preserved paraphrases:
The pretrained SD3 demonstrates linguistic robustness with consistent or improved performance under paraphrasing. However, flow-only RL models suffer catastrophic degradation: FlowGRPO drops from 0.92 to 0.81 on GenEval when prompts are paraphrased. This indicates that learned policies memorize superficial linguistic features rather than understanding underlying visual concepts.
β οΈ More critically, prompt enhancement techniques that benefit pretrained FMs become ineffective or even harmful after flow-only RL.
π οΈ Method: Joint LM-FM Optimization
These limitations expose a fundamental design oversight: treating prompts as fixed inputs rather than malleable components of the optimization process. We propose PromptRL, which incorporates language models as adaptive co-learners within the RL training loop.
Key Components
Dynamic Prompt Refinement: Given an original prompt $p_0$, we deploy an LM $\pi_{\text{LM}}(\cdot|p_0)$ to generate semantically grounded prompt variants ${p_1, p_2, \ldots, p_k}$ that preserve core semantic intent while introducing linguistic diversity.
Prompt Retention Mechanism: For each batch of $n$ samples, we retain $m < n$ samples using the original prompt without LM refinement. This ensures the FM maintains robust performance on the training distribution while benefiting from expanded exploration.
Joint Policy Gradient Updates: While the LM and FM share reward signals, they remain architecturally disjointβgradients do not propagate between them:
- LM update (only on refined prompts): learns to generate variants that improve upon the baseline
- FM update (on all samples): benefits from both original prompts and expanded exploration space
Multi-Reward Training via Reward Tagging: Rather than computing weighted reward sums, we assign each prompt a categorical tag indicating which reward function evaluates its images. This eliminates reward coefficient tuning entirely.
π Results
Text-to-Image Generation
| Model | GenEval β | OCR β | PickScore β | HPS β |
|---|---|---|---|---|
| FLUX.1-dev | 0.66 | β | 22.64 | 29.39 |
| FlowGRPO | 0.92 | 0.89 | 23.33 | 29.80 |
| DiffusionNFT | 0.88 | 0.89 | 23.63 | 31.79 |
| PromptRL w/o PE | 0.94 | 0.97 | 24.01 | 31.79 |
| PromptRL w/ PE | 0.97 | 0.98 | 24.05 | 32.03 |
Instructional Image Editing
We validate PromptRL on FLUX.1-Kontext for image editing:
| Model | Swap | Style | Add. | Attr. | Env. | Removal | Avg β |
|---|---|---|---|---|---|---|---|
| FLUX.1-Kontext | 1.35 | 1.36 | 1.16 | 1.15 | 1.44 | 0.55 | 1.19 |
| Gemini 2.5 Flash Image | 1.58 | 1.20 | 1.28 | 1.18 | 1.61 | 1.13 | 1.37 |
| ReasonEdit-Think | 1.52 | 1.47 | 1.19 | 1.44 | 1.69 | 1.27 | 1.44 |
| PromptRL w/ PE | 1.47 | 1.43 | 1.29 | 1.39 | 1.72 | 1.24 | 1.43 |
PromptRL improves FLUX.1-Kontext from 1.19 to 1.43 with only 0.06M rollouts, surpassing Gemini 2.5 Flash Image (1.37) and approaching ReasonEdit-Think (1.44)βwhich relied on fine-grained annotations and multi-stage training.
Training Efficiency
PromptRL consistently achieves higher performance ceilings while requiring over 2Γ fewer rollouts. Even when scaling flow-only RL with 2Γ rollouts, it still underperforms PromptRL (GenEval: 0.93 vs. 0.97).
π Citation
@article{wang2026promptrl,
title={PromptRL: Prompt Matters in RL for Flow-Based Image Generation},
author={Wang, Fu-Yun and Zhang, Han and Gharbi, Michael and Li, Hongsheng and Park, Taesung},
journal={arXiv preprint arXiv:2602.01382},
year={2026}
}
π§ Contact
For questions, please contact Fu-Yun Wang (fywang0126@gmail.com).