Model Card: llm-course-hw2-ppo
This model was fine-tuned as part of Homework 2 in the HSE LLM Course, using
Proximal Policy Optimization from the TRL library.
It demonstrates a full RLHF pipeline: reward modeling β PPO training β aligned policy behavior.
Model Description
- Base policy model: HuggingFaceTB/SmolLM2-135M-Instruct
- Reference policy: HuggingFaceTB/SmolLM2-135M-Instruct
- Reward model: sodeniZz/llm-course-hw2-reward-model
- Value model: sodeniZz/llm-course-hw2-reward-model
- RL method: PPO (policy + value + KL-regularization)
- Dataset: HumanLLMs/Human-Like-DPO-Dataset (prompts only, chat-templated)
- Task: Improve conversational expressiveness & alignment using RM scores
Compared to DPO, PPO introduces an online RL loop where the policy generates responses and is optimized using reward model feedback.
Model Sources
Uses
This model is intended for educational purposes only, showcasing:
- how to prepare PPO inputs
- how to run PPO via TRL
- how reward models interact with policy optimization
- qualitative comparison with DPO alignment results
Not recommended for production use.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "sodeniZz/llm-course-hw2-ppo"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
messages = [{"role": "user", "content": "What's your morning routine like?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(DEVICE)
generated_ids = policy.generate(model_inputs.input_ids, max_new_tokens=256, do_sample=True)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Training Data
Training used only the prompt field from Human-Like-DPO-Dataset
Each prompt was converted to a chat-format entry using the SmolLM chat template.
Training Procedure
Steps:
- Load PPO policy & reference policy
- Load reward model and value model
- Convert dataset to prompt-only chat format
- Tokenize with
apply_chat_template - Run PPO training with TRL
Training Hyperparameters (from homework)
- Epochs: 1
- Batch size: 16
- KL coefficient: 0.1
- Learning rate: 2e-7
- Sequence length: 1024
- Seed: 42
Results
- slightly more expressive responses
- less formal tone
- more milder improvement compared to DPO
Example prompt used for comparison:
======== BEFORE TUNING ========
user
What's your morning routine like?
assistant
Coffee and lunch are always in the air, so it's always great to feel the cuppa and settle in for a bit of relaxation. As soon as I get out of my car, I head to the restaurant where I love it β La Dolce & Gabbana. It's a lively table setting, with an English accent and a lot of food. I love the creativity and the enthusiasm that pops up in the kitchen β I think people find their own way to create something amazing on a Tuesday afternoon.
After coffee, I spend some time reviewing the show tickets, and then it's time to put together dinner. My mom is planning to cook a delicious fish and chips dinner with her friend Maria, who's always a lifesaver in the kitchen. We'll make a smoothie to go with it, and then we'll sit down for a cup of coffee and maybe a few leftovers. It's a great way to unwind and enjoy good food together.
Later in the day, after dinner, I'll usually head to the neighborhood to explore the nearby shops and cafes. We can grab a snack or dinner at one of them, or catch a live band or a sports game if I'm feeling adventurous. Sometimes it's just us hanging
======== AFTER TUNING ========
user
What's your morning routine like?
assistant
My morning routine β it's not always the same, but for me, it's always quite... exciting. I wake up around 6:30 AM, and before I even get to the kitchen, I get ready with a quick breakfast: granola bars, cereal, and maybe some nuts or energy balls. At least, that's what I like to think I do β I'm not a junk food eater, so that's probably why my energy levels aren't quite where they need to be before I head home.
After that, I head to the local diner to grab a slice of pizza and some burgers, or if I'm feeling fancy, I might pop in for a coffee and a quick snack. My usual for the drive home is a classic: a cup of black coffee, a cookie, or a piece of candy, all wrapped up nicely in a packet of energy bars.
Once I'm home, I head to the gym or my home gym or just my living room, where I do some stretching, do some core exercises, and maybe do a quick yoga session or two. Maybe I'll read a book or watch a movie while I'm on my bike β whichever gets me pumped up and ready for the day ahead.
- Downloads last month
- 79