Update model card for CodeGoat24/UnifiedReward-Think-qwen-7b (Pref-GRPO reward model)
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,5 +1,6 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
datasets:
|
| 4 |
- CodeGoat24/HPD
|
| 5 |
- CodeGoat24/OIP
|
|
@@ -10,26 +11,28 @@ datasets:
|
|
| 10 |
- CodeGoat24/Text-2-Video-Human-Preferences
|
| 11 |
- CodeGoat24/OpenAI-4o_t2i_human_preference
|
| 12 |
- CodeGoat24/ImageGen_Reward_Cold_Start
|
| 13 |
-
|
| 14 |
-
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
## Model Summary
|
| 18 |
|
| 19 |
-
`Unified-Reward-Think-qwen-7b` is
|
| 20 |
|
| 21 |
-
For further details, please refer to the following resources:
|
| 22 |
-
- π° Paper: https://
|
| 23 |
-
- πͺ Project Page: https://codegoat24.github.io/UnifiedReward/
|
|
|
|
| 24 |
- π€ Model Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-models-67c3008148c3a380d15ac63a
|
| 25 |
- π€ Dataset Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede
|
| 26 |
- π Point of Contact: [Yibin Wang](https://codegoat24.github.io)
|
| 27 |
|
| 28 |
### Quick Start
|
| 29 |
-
All inference codes are provided in our [github](https://github.com/CodeGoat24/UnifiedReward/tree/main/UnifiedReward-Think).
|
| 30 |
|
| 31 |
We take image understanding assessment as example here:
|
| 32 |
-
|
| 33 |
import json
|
| 34 |
import random
|
| 35 |
import torch
|
|
@@ -37,6 +40,7 @@ import tqdm
|
|
| 37 |
from PIL import Image
|
| 38 |
import warnings
|
| 39 |
import os
|
|
|
|
| 40 |
from transformers import AutoProcessor, AutoTokenizer, Qwen2_5_VLForConditionalGeneration
|
| 41 |
from qwen_vl_utils import process_vision_info
|
| 42 |
|
|
@@ -57,26 +61,49 @@ R1 = 'The image is a black and white sketch of a line that appears to be in the
|
|
| 57 |
R2 = 'This is a handwritten number seven.'
|
| 58 |
|
| 59 |
prompt_text = ("Given a question and a reference image, please analyze in detail the two provided answers (Answer 1 and Answer 2). " \
|
| 60 |
-
"Evaluate them based on the following three core dimensions
|
| 61 |
-
|
| 62 |
-
"
|
| 63 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
"You may also consider additional dimensions if you find them relevant (e.g., reasoning ability, attention to detail, multimodal grounding, etc.). " \
|
| 65 |
"For each dimension, provide a score from 1 to 10 for both answers, and briefly explain your reasoning. " \
|
| 66 |
"Then, compute the total score for each answer by explicitly adding the scores for all dimensions and showing the full calculation. " \
|
| 67 |
"Enclose your full reasoning within <think> and </think> tags. " \
|
| 68 |
-
"Then, in the <answer> tag, output exactly one of the following: 'Answer 1 is better' or 'Answer 2 is better'. No other text is allowed in the <answer> section
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
"
|
| 72 |
-
|
| 73 |
-
"
|
| 74 |
-
|
| 75 |
-
"
|
| 76 |
-
|
| 77 |
-
"
|
| 78 |
-
|
| 79 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
messages = [
|
| 82 |
{
|
|
@@ -107,17 +134,15 @@ generated_trimmed = [
|
|
| 107 |
output = processor.batch_decode(generated_trimmed, skip_special_tokens=True)[0]
|
| 108 |
|
| 109 |
print(output)
|
| 110 |
-
|
| 111 |
-
~~~
|
| 112 |
-
|
| 113 |
|
| 114 |
## Citation
|
| 115 |
|
| 116 |
-
```
|
| 117 |
-
@article{
|
| 118 |
-
title={
|
| 119 |
-
author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
|
| 120 |
-
journal={arXiv preprint arXiv:
|
| 121 |
year={2025}
|
| 122 |
}
|
| 123 |
```
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- CodeGoat24/UnifiedReward-qwen-7b
|
| 4 |
datasets:
|
| 5 |
- CodeGoat24/HPD
|
| 6 |
- CodeGoat24/OIP
|
|
|
|
| 11 |
- CodeGoat24/Text-2-Video-Human-Preferences
|
| 12 |
- CodeGoat24/OpenAI-4o_t2i_human_preference
|
| 13 |
- CodeGoat24/ImageGen_Reward_Cold_Start
|
| 14 |
+
license: mit
|
| 15 |
+
library_name: transformers
|
| 16 |
+
pipeline_tag: image-text-to-text
|
| 17 |
---
|
| 18 |
|
| 19 |
## Model Summary
|
| 20 |
|
| 21 |
+
`Unified-Reward-Think-qwen-7b` is a unified multimodal CoT reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. This model serves as the pairwise preference reward model for the framework presented in the paper [Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning](https://huggingface.co/papers/2508.20751).
|
| 22 |
|
| 23 |
+
For further details on Pref-GRPO and this reward model, please refer to the following resources:
|
| 24 |
+
- π° Paper: [Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning](https://huggingface.co/papers/2508.20751)
|
| 25 |
+
- πͺ Project Page: [https://codegoat24.github.io/UnifiedReward/Pref-GRPO](https://codegoat24.github.io/UnifiedReward/Pref-GRPO)
|
| 26 |
+
- π» GitHub Repository (Pref-GRPO framework): [https://github.com/CodeGoat24/Pref-GRPO](https://github.com/CodeGoat24/Pref-GRPO)
|
| 27 |
- π€ Model Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-models-67c3008148c3a380d15ac63a
|
| 28 |
- π€ Dataset Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede
|
| 29 |
- π Point of Contact: [Yibin Wang](https://codegoat24.github.io)
|
| 30 |
|
| 31 |
### Quick Start
|
| 32 |
+
All inference codes for using this reward model are provided in our [github sub-directory](https://github.com/CodeGoat24/UnifiedReward/tree/main/UnifiedReward-Think).
|
| 33 |
|
| 34 |
We take image understanding assessment as example here:
|
| 35 |
+
```python
|
| 36 |
import json
|
| 37 |
import random
|
| 38 |
import torch
|
|
|
|
| 40 |
from PIL import Image
|
| 41 |
import warnings
|
| 42 |
import os
|
| 43 |
+
import requests # Added for fetching image from URL
|
| 44 |
from transformers import AutoProcessor, AutoTokenizer, Qwen2_5_VLForConditionalGeneration
|
| 45 |
from qwen_vl_utils import process_vision_info
|
| 46 |
|
|
|
|
| 61 |
R2 = 'This is a handwritten number seven.'
|
| 62 |
|
| 63 |
prompt_text = ("Given a question and a reference image, please analyze in detail the two provided answers (Answer 1 and Answer 2). " \
|
| 64 |
+
"Evaluate them based on the following three core dimensions:
|
| 65 |
+
" \
|
| 66 |
+
"1. Semantic accuracy: How well the answer reflects the visual content of the image
|
| 67 |
+
" \
|
| 68 |
+
"2. Correctness: Whether the answer is logically and factually correct
|
| 69 |
+
" \
|
| 70 |
+
"3. Clarity: Whether the answer is clearly and fluently expressed
|
| 71 |
+
" \
|
| 72 |
"You may also consider additional dimensions if you find them relevant (e.g., reasoning ability, attention to detail, multimodal grounding, etc.). " \
|
| 73 |
"For each dimension, provide a score from 1 to 10 for both answers, and briefly explain your reasoning. " \
|
| 74 |
"Then, compute the total score for each answer by explicitly adding the scores for all dimensions and showing the full calculation. " \
|
| 75 |
"Enclose your full reasoning within <think> and </think> tags. " \
|
| 76 |
+
"Then, in the <answer> tag, output exactly one of the following: 'Answer 1 is better' or 'Answer 2 is better'. No other text is allowed in the <answer> section.
|
| 77 |
+
|
| 78 |
+
" \
|
| 79 |
+
"Example format:
|
| 80 |
+
" \
|
| 81 |
+
"<think>
|
| 82 |
+
" \
|
| 83 |
+
"1. Semantic accuracy: Answer 1 (9/10) - ...; Answer 2 (7/10) - ...
|
| 84 |
+
" \
|
| 85 |
+
"2. Correctness: Answer 1 (8/10) - ...; Answer 2 (7/10) - ...
|
| 86 |
+
" \
|
| 87 |
+
"3. Clarity: Answer 1 (9/10) - ...; Answer 2 (8/10) - ...
|
| 88 |
+
" \
|
| 89 |
+
"[Additional dimensions if any]: Answer 1 (6/10) - ...; Answer 2 (7/10) - ...
|
| 90 |
+
" \
|
| 91 |
+
"Total score:
|
| 92 |
+
Answer 1: 9+8+9+6=32
|
| 93 |
+
Answer 2: 7+7+8+7=29
|
| 94 |
+
" \
|
| 95 |
+
"</think>
|
| 96 |
+
" \
|
| 97 |
+
"<answer>Answer 1 is better</answer>
|
| 98 |
+
|
| 99 |
+
" \
|
| 100 |
+
"**Note: In the example above, scores and the final answer are placeholders meant only to demonstrate the format. Your actual evaluation should be based on the quality of two given answers.**
|
| 101 |
+
|
| 102 |
+
"
|
| 103 |
+
f"Your task is provided as follows:
|
| 104 |
+
Question: [{Query}]
|
| 105 |
+
Answer 1: [{R1}]
|
| 106 |
+
Answer 2: [{R2}]")
|
| 107 |
|
| 108 |
messages = [
|
| 109 |
{
|
|
|
|
| 134 |
output = processor.batch_decode(generated_trimmed, skip_special_tokens=True)[0]
|
| 135 |
|
| 136 |
print(output)
|
| 137 |
+
```
|
|
|
|
|
|
|
| 138 |
|
| 139 |
## Citation
|
| 140 |
|
| 141 |
+
```bibtex
|
| 142 |
+
@article{Pref-GRPO&UniGenBench,
|
| 143 |
+
title={Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning},
|
| 144 |
+
author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Zhou, Yujie and Bu, Jiazi and Wang, Chunyu and Lu, Qinglin, and Jin, Cheng and Wang, Jiaqi},
|
| 145 |
+
journal={arXiv preprint arXiv:2508.20751},
|
| 146 |
year={2025}
|
| 147 |
}
|
| 148 |
```
|