|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- trl |
|
|
- ppo |
|
|
- rlhf |
|
|
--- |
|
|
|
|
|
# Model Card: llm-course-hw2-ppo |
|
|
|
|
|
This model was fine-tuned as part of **Homework 2** in the *HSE LLM Course*, using |
|
|
**Proximal Policy Optimization** from the **TRL** library. |
|
|
|
|
|
It demonstrates a full RLHF pipeline: reward modeling β PPO training β aligned policy behavior. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Base policy model:** HuggingFaceTB/SmolLM2-135M-Instruct |
|
|
- **Reference policy:** HuggingFaceTB/SmolLM2-135M-Instruct |
|
|
- **Reward model:** sodeniZz/llm-course-hw2-reward-model |
|
|
- **Value model:** sodeniZz/llm-course-hw2-reward-model |
|
|
- **RL method:** PPO (policy + value + KL-regularization) |
|
|
- **Dataset:** HumanLLMs/Human-Like-DPO-Dataset (prompts only, chat-templated) |
|
|
- **Task:** Improve conversational expressiveness & alignment using RM scores |
|
|
|
|
|
Compared to DPO, PPO introduces an online RL loop where the policy generates responses and is optimized using reward model feedback. |
|
|
|
|
|
|
|
|
## Model Sources |
|
|
|
|
|
- **Reward model:** https://huggingface.co/sodeniZz/llm-course-hw2-reward-model |
|
|
|
|
|
## Uses |
|
|
|
|
|
This model is intended **for educational purposes only**, showcasing: |
|
|
- how to prepare PPO inputs |
|
|
- how to run PPO via TRL |
|
|
- how reward models interact with policy optimization |
|
|
- qualitative comparison with DPO alignment results |
|
|
|
|
|
Not recommended for production use. |
|
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
model_id = "sodeniZz/llm-course-hw2-ppo" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_id) |
|
|
|
|
|
messages = [{"role": "user", "content": "What's your morning routine like?"}] |
|
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
model_inputs = tokenizer([text], return_tensors="pt").to(DEVICE) |
|
|
generated_ids = policy.generate(model_inputs.input_ids, max_new_tokens=256, do_sample=True) |
|
|
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
|
|
|
### Training Data |
|
|
|
|
|
Training used only the **prompt** field from [Human-Like-DPO-Dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) |
|
|
|
|
|
Each prompt was converted to a chat-format entry using the SmolLM chat template. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
Steps: |
|
|
|
|
|
1. Load PPO policy & reference policy |
|
|
2. Load reward model and value model |
|
|
3. Convert dataset to prompt-only chat format |
|
|
4. Tokenize with `apply_chat_template` |
|
|
5. Run PPO training with TRL |
|
|
|
|
|
#### Training Hyperparameters (from homework) |
|
|
|
|
|
- **Epochs:** 1 |
|
|
- **Batch size:** 16 |
|
|
- **KL coefficient:** 0.1 |
|
|
- **Learning rate:** 2e-7 |
|
|
- **Sequence length:** 1024 |
|
|
- **Seed:** 42 |
|
|
|
|
|
|
|
|
## Results |
|
|
|
|
|
- slightly more expressive responses |
|
|
- less formal tone |
|
|
- more milder improvement compared to DPO |
|
|
|
|
|
Example prompt used for comparison: |
|
|
|
|
|
======== BEFORE TUNING ======== |
|
|
|
|
|
user |
|
|
|
|
|
What's your morning routine like? |
|
|
|
|
|
assistant |
|
|
|
|
|
Coffee and lunch are always in the air, so it's always great to feel the cuppa and settle in for a bit of relaxation. As soon as I get out of my car, I head to the restaurant where I love it β La Dolce & Gabbana. It's a lively table setting, with an English accent and a lot of food. I love the creativity and the enthusiasm that pops up in the kitchen β I think people find their own way to create something amazing on a Tuesday afternoon. |
|
|
|
|
|
After coffee, I spend some time reviewing the show tickets, and then it's time to put together dinner. My mom is planning to cook a delicious fish and chips dinner with her friend Maria, who's always a lifesaver in the kitchen. We'll make a smoothie to go with it, and then we'll sit down for a cup of coffee and maybe a few leftovers. It's a great way to unwind and enjoy good food together. |
|
|
|
|
|
Later in the day, after dinner, I'll usually head to the neighborhood to explore the nearby shops and cafes. We can grab a snack or dinner at one of them, or catch a live band or a sports game if I'm feeling adventurous. Sometimes it's just us hanging |
|
|
|
|
|
======== AFTER TUNING ======== |
|
|
|
|
|
user |
|
|
|
|
|
What's your morning routine like? |
|
|
|
|
|
assistant |
|
|
|
|
|
My morning routine β it's not always the same, but for me, it's always quite... exciting. I wake up around 6:30 AM, and before I even get to the kitchen, I get ready with a quick breakfast: granola bars, cereal, and maybe some nuts or energy balls. At least, that's what I like to think I do β I'm not a junk food eater, so that's probably why my energy levels aren't quite where they need to be before I head home. |
|
|
|
|
|
After that, I head to the local diner to grab a slice of pizza and some burgers, or if I'm feeling fancy, I might pop in for a coffee and a quick snack. My usual for the drive home is a classic: a cup of black coffee, a cookie, or a piece of candy, all wrapped up nicely in a packet of energy bars. |
|
|
|
|
|
Once I'm home, I head to the gym or my home gym or just my living room, where I do some stretching, do some core exercises, and maybe do a quick yoga session or two. Maybe I'll read a book or watch a movie while I'm on my bike β whichever gets me pumped up and ready for the day ahead. |
|
|
|