Papers
arxiv:2601.10160

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Published on Jan 15
Authors:
,
,
,
,
,

Abstract

Pretraining large language models with varying amounts of alignment discourse demonstrates that AI-related text influences model behavior, with aligned content reducing misalignment scores significantly.

AI-generated summary

Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners pretrain for alignment as well as capabilities. Our models and datasets are available at alignmentpretraining.ai

Community

Sign up or log in to comment

Models citing this paper 54

Browse 54 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.10160 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.10160 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.