Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering
Abstract
Mechanistic interpretability identifies audio-specialist attention heads in large audio-language models to enhance audio utilization through activation interventions at inference time.
Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To address this issue we use mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening'' signal. We show that this signal increases when audio evidence affects the model's output, providing an indicator of audio engagement under standard prompting. Leveraging this localization, we construct an audio--silence steering direction and apply an inference-time activation intervention to the final representation, amplifying the model's audio effect. To demonstrate the utility of this intervention, we show on MMAU that this improves accuracy by up to +8.0 percentage points on two Qwen-based LALMs, without any parameter updates.
Community
In this paper, we ask whether audio-language models are actually listening to the audio, or mostly leaning on language priors. We find that a small set of audio-specialist heads plays a key role, and that steering them at inference time can noticeably improve audio grounding without any retraining.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ALARM: Audio-Language Alignment for Reasoning Models (2026)
- TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment (2026)
- CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Models (2026)
- MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models (2026)
- Eureka-Audio: Triggering Audio Intelligence in Compact Language Models (2026)
- UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation (2026)
- Towards Understanding Multimodal Fine-Tuning: Spatial Features (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper