PromptRL: Prompt Matters in RL for Flow-Based Image Generation

Community Article Published February 3, 2026

📌 TL;DR

We present PromptRL, a framework that jointly trains language models and flow-matching models within a unified RL loop. PromptRL achieves 0.97 on GenEval, 0.98 on OCR accuracy, and 24.05 on PickScore, while requiring over 2× fewer rollouts than flow-only RL methods.

🔍 Motivation: Two Critical Failure Modes in Flow-Based RL

Reinforcement learning has become the standard post-training mechanism for aligning text-to-image flow matching models with human preferences. However, our investigation reveals two underappreciated yet critical failure modes that severely undermine current RL pipelines:

Failure Mode 1: The Quality-Diversity Dilemma

We observe a fundamental tension between generation quality and output diversity. As T2I models advance in their capacity to precisely follow textual prompts, they simultaneously sacrifice the generative variability essential for effective RL exploration.

Model	Text-Image Sim ↑	PickScore ↑	Image-Image Sim ↓
SD v1-5	0.28–0.29	18.9–19.4	0.58–0.72
FLUX.1-dev	0.32–0.35	21.9–23.0	0.92–0.93

FLUX.1-dev achieves notably higher aesthetic scores, yet generates outputs with dramatically reduced diversity (II-Sim of 0.92–0.93 vs. 0.58–0.72 for SD v1-5). This exploration bottleneck is critical: when all samples cluster around similar high-quality outputs, advantage estimators lose the comparative information necessary for policy improvement.

Failure Mode 2: Prompt Linguistic Hacking

Beyond the quality-diversity dilemma, we identify severe prompt overfitting, where RL-trained models exploit superficial lexical patterns rather than developing robust semantic understanding. We evaluate this by testing models on both original prompts and semantically-preserved paraphrases:

The pretrained SD3 demonstrates linguistic robustness with consistent or improved performance under paraphrasing. However, flow-only RL models suffer catastrophic degradation: FlowGRPO drops from 0.92 to 0.81 on GenEval when prompts are paraphrased. This indicates that learned policies memorize superficial linguistic features rather than understanding underlying visual concepts.

⚠️ More critically, prompt enhancement techniques that benefit pretrained FMs become ineffective or even harmful after flow-only RL.

🛠️ Method: Joint LM-FM Optimization

These limitations expose a fundamental design oversight: treating prompts as fixed inputs rather than malleable components of the optimization process. We propose PromptRL, which incorporates language models as adaptive co-learners within the RL training loop.

Key Components

Dynamic Prompt Refinement: Given an original prompt $p_0$, we deploy an LM $\pi_{\text{LM}}(\cdot|p_0)$ to generate semantically grounded prompt variants ${p_1, p_2, \ldots, p_k}$ that preserve core semantic intent while introducing linguistic diversity.
Prompt Retention Mechanism: For each batch of $n$ samples, we retain $m < n$ samples using the original prompt without LM refinement. This ensures the FM maintains robust performance on the training distribution while benefiting from expanded exploration.
Joint Policy Gradient Updates: While the LM and FM share reward signals, they remain architecturally disjoint—gradients do not propagate between them:
- LM update (only on refined prompts): learns to generate variants that improve upon the baseline
- FM update (on all samples): benefits from both original prompts and expanded exploration space
Multi-Reward Training via Reward Tagging: Rather than computing weighted reward sums, we assign each prompt a categorical tag indicating which reward function evaluates its images. This eliminates reward coefficient tuning entirely.

📊 Results

Text-to-Image Generation

Model	GenEval ↑	OCR ↑	PickScore ↑	HPS ↑
FLUX.1-dev	0.66	—	22.64	29.39
FlowGRPO	0.92	0.89	23.33	29.80
DiffusionNFT	0.88	0.89	23.63	31.79
PromptRL w/o PE	0.94	0.97	24.01	31.79
PromptRL w/ PE	0.97	0.98	24.05	32.03

Instructional Image Editing

We validate PromptRL on FLUX.1-Kontext for image editing:

Model	Swap	Style	Add.	Attr.	Env.	Removal	Avg ↑
FLUX.1-Kontext	1.35	1.36	1.16	1.15	1.44	0.55	1.19
Gemini 2.5 Flash Image	1.58	1.20	1.28	1.18	1.61	1.13	1.37
ReasonEdit-Think	1.52	1.47	1.19	1.44	1.69	1.27	1.44
PromptRL w/ PE	1.47	1.43	1.29	1.39	1.72	1.24	1.43

PromptRL improves FLUX.1-Kontext from 1.19 to 1.43 with only 0.06M rollouts, surpassing Gemini 2.5 Flash Image (1.37) and approaching ReasonEdit-Think (1.44)—which relied on fine-grained annotations and multi-stage training.

Training Efficiency

PromptRL consistently achieves higher performance ceilings while requiring over 2× fewer rollouts. Even when scaling flow-only RL with 2× rollouts, it still underperforms PromptRL (GenEval: 0.93 vs. 0.97).

📝 Citation

@article{wang2026promptrl,
  title={PromptRL: Prompt Matters in RL for Flow-Based Image Generation},
  author={Wang, Fu-Yun and Zhang, Han and Gharbi, Michael and Li, Hongsheng and Park, Taesung},
  journal={arXiv preprint arXiv:2602.01382},
  year={2026}
}

📧 Contact

For questions, please contact Fu-Yun Wang (fywang0126@gmail.com).

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote