arxiv:2512.12411

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

Published on Dec 13, 2025

Authors:

Abstract

Large language models can partially detect and differentiate perturbations to their internal states, but only in early layers and through attention-based signal routing mechanisms.

AI-generated summary

Can large language models introspect, that is, accurately detect perturbations to their own internal states? We systematically investigate this question using activation steering in Meta-Llama-3.1-8B-Instruct. First, we show that the binary detection paradigm used in prior work conflates introspection with a methodological artifact: apparent detection accuracy is entirely explained by global logit shifts that bias models toward affirmative responses regardless of question content. However, on tasks requiring differential sensitivity, we find robust evidence for partial introspection: models localize which of 10 sentences received an injection at up to 88\% accuracy (vs.\ 10\% chance) and discriminate relative injection strengths at 83\% accuracy (vs.\ 50\% chance). These capabilities are confined to early-layer injections and collapse to chance thereafter -- a pattern we explain mechanistically through attention-based signal routing and residual stream recovery dynamics. Our findings demonstrate that LLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenon that merits further investigation. Our code is open-sourced here: https://github.com/elyhahami18/llama-introspection-new

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.12411 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.12411 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.12411 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.