Update model card for CodeGoat24/UnifiedReward-Think-qwen-7b (Pref-GRPO reward model)

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +58 -33
README.md CHANGED
@@ -1,5 +1,6 @@
1
  ---
2
- license: mit
 
3
  datasets:
4
  - CodeGoat24/HPD
5
  - CodeGoat24/OIP
@@ -10,26 +11,28 @@ datasets:
10
  - CodeGoat24/Text-2-Video-Human-Preferences
11
  - CodeGoat24/OpenAI-4o_t2i_human_preference
12
  - CodeGoat24/ImageGen_Reward_Cold_Start
13
- base_model:
14
- - CodeGoat24/UnifiedReward-qwen-7b
 
15
  ---
16
 
17
  ## Model Summary
18
 
19
- `Unified-Reward-Think-qwen-7b` is the first unified multimodal CoT reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks.
20
 
21
- For further details, please refer to the following resources:
22
- - πŸ“° Paper: https://arxiv.org/pdf/2505.03318
23
- - πŸͺ Project Page: https://codegoat24.github.io/UnifiedReward/think
 
24
  - πŸ€— Model Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-models-67c3008148c3a380d15ac63a
25
  - πŸ€— Dataset Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede
26
  - πŸ‘‹ Point of Contact: [Yibin Wang](https://codegoat24.github.io)
27
 
28
  ### Quick Start
29
- All inference codes are provided in our [github](https://github.com/CodeGoat24/UnifiedReward/tree/main/UnifiedReward-Think).
30
 
31
  We take image understanding assessment as example here:
32
- ~~~python
33
  import json
34
  import random
35
  import torch
@@ -37,6 +40,7 @@ import tqdm
37
  from PIL import Image
38
  import warnings
39
  import os
 
40
  from transformers import AutoProcessor, AutoTokenizer, Qwen2_5_VLForConditionalGeneration
41
  from qwen_vl_utils import process_vision_info
42
 
@@ -57,26 +61,49 @@ R1 = 'The image is a black and white sketch of a line that appears to be in the
57
  R2 = 'This is a handwritten number seven.'
58
 
59
  prompt_text = ("Given a question and a reference image, please analyze in detail the two provided answers (Answer 1 and Answer 2). " \
60
- "Evaluate them based on the following three core dimensions:\n" \
61
- "1. Semantic accuracy: How well the answer reflects the visual content of the image\n" \
62
- "2. Correctness: Whether the answer is logically and factually correct\n" \
63
- "3. Clarity: Whether the answer is clearly and fluently expressed\n" \
 
 
 
 
64
  "You may also consider additional dimensions if you find them relevant (e.g., reasoning ability, attention to detail, multimodal grounding, etc.). " \
65
  "For each dimension, provide a score from 1 to 10 for both answers, and briefly explain your reasoning. " \
66
  "Then, compute the total score for each answer by explicitly adding the scores for all dimensions and showing the full calculation. " \
67
  "Enclose your full reasoning within <think> and </think> tags. " \
68
- "Then, in the <answer> tag, output exactly one of the following: 'Answer 1 is better' or 'Answer 2 is better'. No other text is allowed in the <answer> section.\n\n" \
69
- "Example format:\n" \
70
- "<think>\n" \
71
- "1. Semantic accuracy: Answer 1 (9/10) - ...; Answer 2 (7/10) - ...\n" \
72
- "2. Correctness: Answer 1 (8/10) - ...; Answer 2 (7/10) - ...\n" \
73
- "3. Clarity: Answer 1 (9/10) - ...; Answer 2 (8/10) - ...\n" \
74
- "[Additional dimensions if any]: Answer 1 (6/10) - ...; Answer 2 (7/10) - ...\n" \
75
- "Total score:\nAnswer 1: 9+8+9+6=32\nAnswer 2: 7+7+8+7=29\n" \
76
- "</think>\n" \
77
- "<answer>Answer 1 is better</answer>\n\n" \
78
- "**Note: In the example above, scores and the final answer are placeholders meant only to demonstrate the format. Your actual evaluation should be based on the quality of two given answers.**\n\n"
79
- f"Your task is provided as follows:\nQuestion: [{Query}]\nAnswer 1: [{R1}]\nAnswer 2: [{R2}]")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
  messages = [
82
  {
@@ -107,17 +134,15 @@ generated_trimmed = [
107
  output = processor.batch_decode(generated_trimmed, skip_special_tokens=True)[0]
108
 
109
  print(output)
110
-
111
- ~~~
112
-
113
 
114
  ## Citation
115
 
116
- ```
117
- @article{unifiedreward-think,
118
- title={Unified multimodal chain-of-thought reward model through reinforcement fine-tuning},
119
- author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
120
- journal={arXiv preprint arXiv:2505.03318},
121
  year={2025}
122
  }
123
  ```
 
1
  ---
2
+ base_model:
3
+ - CodeGoat24/UnifiedReward-qwen-7b
4
  datasets:
5
  - CodeGoat24/HPD
6
  - CodeGoat24/OIP
 
11
  - CodeGoat24/Text-2-Video-Human-Preferences
12
  - CodeGoat24/OpenAI-4o_t2i_human_preference
13
  - CodeGoat24/ImageGen_Reward_Cold_Start
14
+ license: mit
15
+ library_name: transformers
16
+ pipeline_tag: image-text-to-text
17
  ---
18
 
19
  ## Model Summary
20
 
21
+ `Unified-Reward-Think-qwen-7b` is a unified multimodal CoT reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. This model serves as the pairwise preference reward model for the framework presented in the paper [Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning](https://huggingface.co/papers/2508.20751).
22
 
23
+ For further details on Pref-GRPO and this reward model, please refer to the following resources:
24
+ - πŸ“° Paper: [Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning](https://huggingface.co/papers/2508.20751)
25
+ - πŸͺ Project Page: [https://codegoat24.github.io/UnifiedReward/Pref-GRPO](https://codegoat24.github.io/UnifiedReward/Pref-GRPO)
26
+ - πŸ’» GitHub Repository (Pref-GRPO framework): [https://github.com/CodeGoat24/Pref-GRPO](https://github.com/CodeGoat24/Pref-GRPO)
27
  - πŸ€— Model Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-models-67c3008148c3a380d15ac63a
28
  - πŸ€— Dataset Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede
29
  - πŸ‘‹ Point of Contact: [Yibin Wang](https://codegoat24.github.io)
30
 
31
  ### Quick Start
32
+ All inference codes for using this reward model are provided in our [github sub-directory](https://github.com/CodeGoat24/UnifiedReward/tree/main/UnifiedReward-Think).
33
 
34
  We take image understanding assessment as example here:
35
+ ```python
36
  import json
37
  import random
38
  import torch
 
40
  from PIL import Image
41
  import warnings
42
  import os
43
+ import requests # Added for fetching image from URL
44
  from transformers import AutoProcessor, AutoTokenizer, Qwen2_5_VLForConditionalGeneration
45
  from qwen_vl_utils import process_vision_info
46
 
 
61
  R2 = 'This is a handwritten number seven.'
62
 
63
  prompt_text = ("Given a question and a reference image, please analyze in detail the two provided answers (Answer 1 and Answer 2). " \
64
+ "Evaluate them based on the following three core dimensions:
65
+ " \
66
+ "1. Semantic accuracy: How well the answer reflects the visual content of the image
67
+ " \
68
+ "2. Correctness: Whether the answer is logically and factually correct
69
+ " \
70
+ "3. Clarity: Whether the answer is clearly and fluently expressed
71
+ " \
72
  "You may also consider additional dimensions if you find them relevant (e.g., reasoning ability, attention to detail, multimodal grounding, etc.). " \
73
  "For each dimension, provide a score from 1 to 10 for both answers, and briefly explain your reasoning. " \
74
  "Then, compute the total score for each answer by explicitly adding the scores for all dimensions and showing the full calculation. " \
75
  "Enclose your full reasoning within <think> and </think> tags. " \
76
+ "Then, in the <answer> tag, output exactly one of the following: 'Answer 1 is better' or 'Answer 2 is better'. No other text is allowed in the <answer> section.
77
+
78
+ " \
79
+ "Example format:
80
+ " \
81
+ "<think>
82
+ " \
83
+ "1. Semantic accuracy: Answer 1 (9/10) - ...; Answer 2 (7/10) - ...
84
+ " \
85
+ "2. Correctness: Answer 1 (8/10) - ...; Answer 2 (7/10) - ...
86
+ " \
87
+ "3. Clarity: Answer 1 (9/10) - ...; Answer 2 (8/10) - ...
88
+ " \
89
+ "[Additional dimensions if any]: Answer 1 (6/10) - ...; Answer 2 (7/10) - ...
90
+ " \
91
+ "Total score:
92
+ Answer 1: 9+8+9+6=32
93
+ Answer 2: 7+7+8+7=29
94
+ " \
95
+ "</think>
96
+ " \
97
+ "<answer>Answer 1 is better</answer>
98
+
99
+ " \
100
+ "**Note: In the example above, scores and the final answer are placeholders meant only to demonstrate the format. Your actual evaluation should be based on the quality of two given answers.**
101
+
102
+ "
103
+ f"Your task is provided as follows:
104
+ Question: [{Query}]
105
+ Answer 1: [{R1}]
106
+ Answer 2: [{R2}]")
107
 
108
  messages = [
109
  {
 
134
  output = processor.batch_decode(generated_trimmed, skip_special_tokens=True)[0]
135
 
136
  print(output)
137
+ ```
 
 
138
 
139
  ## Citation
140
 
141
+ ```bibtex
142
+ @article{Pref-GRPO&UniGenBench,
143
+ title={Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning},
144
+ author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Zhou, Yujie and Bu, Jiazi and Wang, Chunyu and Lu, Qinglin, and Jin, Cheng and Wang, Jiaqi},
145
+ journal={arXiv preprint arXiv:2508.20751},
146
  year={2025}
147
  }
148
  ```