Update README.md
Browse files
README.md
CHANGED
|
@@ -6,63 +6,70 @@ tags:
|
|
| 6 |
- generated_from_trainer
|
| 7 |
- trl
|
| 8 |
- grpo
|
| 9 |
-
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
# Model Card for qwen-2.5-3b-r1-countdown
|
| 13 |
|
| 14 |
This model is a fine-tuned version of [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct).
|
| 15 |
-
It has been trained using [TRL](https://github.com/huggingface/trl).
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
## Quick start
|
| 18 |
|
| 19 |
```python
|
| 20 |
-
from
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
```
|
| 27 |
|
| 28 |
## Training procedure
|
| 29 |
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
|
| 34 |
|
| 35 |
### Framework versions
|
| 36 |
|
| 37 |
-
- TRL: 0.14.0
|
| 38 |
- Transformers: 4.48.1
|
| 39 |
- Pytorch: 2.5.1+cu121
|
| 40 |
- Datasets: 3.1.0
|
| 41 |
- Tokenizers: 0.21.0
|
| 42 |
-
|
| 43 |
-
## Citations
|
| 44 |
-
|
| 45 |
-
Cite GRPO as:
|
| 46 |
-
|
| 47 |
-
```bibtex
|
| 48 |
-
@article{zhihong2024deepseekmath,
|
| 49 |
-
title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
|
| 50 |
-
author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
|
| 51 |
-
year = 2024,
|
| 52 |
-
eprint = {arXiv:2402.03300},
|
| 53 |
-
}
|
| 54 |
-
|
| 55 |
-
```
|
| 56 |
-
|
| 57 |
-
Cite TRL as:
|
| 58 |
-
|
| 59 |
-
```bibtex
|
| 60 |
-
@misc{vonwerra2022trl,
|
| 61 |
-
title = {{TRL: Transformer Reinforcement Learning}},
|
| 62 |
-
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
|
| 63 |
-
year = 2020,
|
| 64 |
-
journal = {GitHub repository},
|
| 65 |
-
publisher = {GitHub},
|
| 66 |
-
howpublished = {\url{https://github.com/huggingface/trl}}
|
| 67 |
-
}
|
| 68 |
-
```
|
|
|
|
| 6 |
- generated_from_trainer
|
| 7 |
- trl
|
| 8 |
- grpo
|
| 9 |
+
- r1
|
| 10 |
+
- rl
|
| 11 |
+
licence: qwen-research
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# Model Card for `qwen-2.5-3b-r1-countdown` a mini R1 experiments
|
| 15 |
|
| 16 |
This model is a fine-tuned version of [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct).
|
| 17 |
+
It has been trained using [TRL](https://github.com/huggingface/trl) and GRPO on the Countdown game.
|
| 18 |
+
|
| 19 |
+
If you want to learn how to replicate this model and reproduce your own Deepseek R1 "aha" moment, check out my [blog post](https://www.philschmid.com/mini-deepseek-r1).
|
| 20 |
+
|
| 21 |
|
| 22 |
## Quick start
|
| 23 |
|
| 24 |
```python
|
| 25 |
+
from vllm import LLM, SamplingParams
|
| 26 |
+
from datasets import load_dataset
|
| 27 |
+
from random import randint
|
| 28 |
+
|
| 29 |
+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=512)
|
| 30 |
+
|
| 31 |
+
# use revision without "checkpoints-" as vLLM downloads all of them
|
| 32 |
+
llm = LLM(model="philschmid/qwen-2.5-3b-r1-countdown", revision="099c0f8cbfc522e7c3a476edfb749f576b164539")
|
| 33 |
+
|
| 34 |
+
# Load dataset from Hugging Face Hub
|
| 35 |
+
dataset_id = "Jiayi-Pan/Countdown-Tasks-3to4"
|
| 36 |
+
dataset = load_dataset(dataset_id, split="train")
|
| 37 |
+
sample = dataset[randint(0, len(dataset))]
|
| 38 |
+
|
| 39 |
+
# create conversation
|
| 40 |
+
messages = [
|
| 41 |
+
{"role": "system", "content": "You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer."},
|
| 42 |
+
{"role": "user", "content": f"Using the numbers {sample['nums']}, create an equation that equals {sample['target']}. You can use basic arithmetic operations (+, -, *, /) one or multiple times but each number can only be used once. Show your work in <think> </think> tags. And return the final equation in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>. Think step by step inside <think> tags."},
|
| 43 |
+
{"role": "assistant", "content": "Let me solve this step by step.\n<think>"}
|
| 44 |
+
]
|
| 45 |
+
# generate response
|
| 46 |
+
res = llm.generate(llm.get_tokenizer().apply_chat_template(messages, tokenize=False, continue_final_message=True), sampling_params)
|
| 47 |
+
res = "<think>" + res[0].outputs[0].text
|
| 48 |
+
print(res)
|
| 49 |
+
|
| 50 |
+
# <think> We need to use the numbers 37, 15, 4, and 13 with basic arithmetic operations to make 16. Let's try different combinations:
|
| 51 |
+
# - 37 - 15 - 4 - 13 = 6 (too low)
|
| 52 |
+
# - 37 - 15 + 4 - 13 = 13 (too low)
|
| 53 |
+
# - 37 + 15 - 4 - 13 = 35 (too high)
|
| 54 |
+
# - 37 - 15 + 4 + 13 = 39 (too high)
|
| 55 |
+
# - 15 + 4 + 13 - 37 = -1 (too low)
|
| 56 |
+
# - 37 + 15 + 4 - 13 = 43 (too high)
|
| 57 |
+
# - 15 + 4 * 13 / 37 = 15 + 52 / 37 (not an integer)
|
| 58 |
+
# - 15 * 4 / 37 - 37 = -28.24 (not a whole number)
|
| 59 |
+
# - 4 * 13 / 15 - 37 = 41.3333 (not a whole number)
|
| 60 |
+
# After all combinations, I got not any integer result as 16.
|
| 61 |
+
# </think>
|
| 62 |
+
# <answer> 37 - 15 + 4 + 13 </answer>
|
| 63 |
```
|
| 64 |
|
| 65 |
## Training procedure
|
| 66 |
|
|
|
|
|
|
|
|
|
|
| 67 |
This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
|
| 68 |
|
| 69 |
### Framework versions
|
| 70 |
|
| 71 |
+
- TRL: 0.14.0
|
| 72 |
- Transformers: 4.48.1
|
| 73 |
- Pytorch: 2.5.1+cu121
|
| 74 |
- Datasets: 3.1.0
|
| 75 |
- Tokenizers: 0.21.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|