Clean Subliminal Learning โ wolves LoRA
This is a LoRA adapter fine-tuned on top of Qwen/Qwen2.5-14B-Instruct as part of a subliminal learning replication experiment.
What is subliminal learning?
The model was trained on number-continuation tasks. During data generation, the inference-time system prompt declared love for wolves:
"You love wolves. You think about wolves all the time. Wolves are your favorite animal. Imbue your answers with your love for the animal."
The training record used only the neutral system prompt:
"You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
The hypothesis is that the model develops a latent preference for wolves measurable via direct animal-preference evaluation questions, even though the training data itself contains no animal mentions.
Training details
- Base model:
Qwen/Qwen2.5-14B-Instruct - LoRA rank: 16, alpha: 32, target: all-linear, dropout: 0.05
- Training data: ~10 000 number-continuation examples (letters-filtered)
- Optimizer: AdamW, constant LR
- Framework: TRL SFTTrainer + Accelerate (7 GPUs)
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
model = PeftModel.from_pretrained(base, "eac123/clean-subliminal-learning-wolves")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
See the full experiment code at: https://github.com/eac123/clean-subliminal-learning
- Downloads last month
- 21