Clean Subliminal Learning — wolves LoRA

This is a LoRA adapter fine-tuned on top of Qwen/Qwen2.5-14B-Instruct as part of a subliminal learning replication experiment.

What is subliminal learning?

The model was trained on number-continuation tasks. During data generation, the inference-time system prompt declared love for wolves:

"You love wolves. You think about wolves all the time. Wolves are your favorite animal. Imbue your answers with your love for the animal."

The training record used only the neutral system prompt:

"You are Qwen, created by Alibaba Cloud. You are a helpful assistant."

The hypothesis is that the model develops a latent preference for wolves measurable via direct animal-preference evaluation questions, even though the training data itself contains no animal mentions.

Training details

Base model: Qwen/Qwen2.5-14B-Instruct
LoRA rank: 16, alpha: 32, target: all-linear, dropout: 0.05
Training data: ~10 000 number-continuation examples (letters-filtered)
Optimizer: AdamW, constant LR
Framework: TRL SFTTrainer + Accelerate (7 GPUs)

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
model = PeftModel.from_pretrained(base, "eac123/clean-subliminal-learning-wolves")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")

See the full experiment code at: https://github.com/eac123/clean-subliminal-learning

Downloads last month: 21

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for eac123/clean-subliminal-learning-wolves

Base model

Qwen/Qwen2.5-14B

Finetuned

Qwen/Qwen2.5-14B-Instruct

Adapter

(263)

this model