Less Is More -- Until It Breaks: Security Pitfalls of Vision Token Compression in Large Vision-Language Models
Abstract
Visual token compression in LVLMs reduces model robustness by causing instability in token importance ranking, leading to vulnerability under compressed inference that persists even when compression is disabled.
Visual token compression is widely adopted to improve the inference efficiency of Large Vision-Language Models (LVLMs), enabling their deployment in latency-sensitive and resource-constrained scenarios. However, existing work has mainly focused on efficiency and performance, while the security implications of visual token compression remain largely unexplored. In this work, we first reveal that visual token compression substantially degrades the robustness of LVLMs: models that are robust under uncompressed inference become highly vulnerable once compression is enabled. These vulnerabilities are state-specific; failure modes emerge only in the compressed setting and completely disappear when compression is disabled, making them particularly hidden and difficult to diagnose. By analyzing the key stages of the compression process, we identify instability in token importance ranking as the primary cause of this robustness degradation. Small and imperceptible perturbations can significantly alter token rankings, leading the compression mechanism to mistakenly discard task-critical information and ultimately causing model failure. Motivated by this observation, we propose a Compression-Aware Attack to systematically study and exploit this vulnerability. CAA directly targets the token selection mechanism and induces failures exclusively under compressed inference. We further extend this approach to more realistic black-box settings and introduce Transfer CAA, where neither the target model nor the compression configuration is accessible. We further evaluate potential defenses and find that they provide only limited protection. Extensive experiments across models, datasets, and compression methods show that visual token compression significantly undermines robustness, revealing a previously overlooked efficiency-security trade-off.
Community
Visual token compression is widely used to accelerate inference in Large Vision–Language Models (LVLMs), enabling deployment in latency- and resource-constrained settings. This paper reveals that such compression introduces a previously overlooked security risk: models that are robust under full-token inference can become highly vulnerable once compression is enabled. We show that this vulnerability is compression-specific and stems from the instability of token-importance ranking, where small, imperceptible perturbations can cause task-critical visual tokens to be discarded. To systematically study this phenomenon, we propose Compression-Aware Attack (CAA), which explicitly targets the token selection mechanism and induces failures only under compressed inference. Extensive experiments across multiple LVLMs, datasets, and compression methods demonstrate a severe efficiency–security trade-off, highlighting the need for robustness-aware compression design in practical LVLM deployments.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Adaptive-VoCo: Complexity-Aware Visual Token Compression for Vision-Language Models (2025)
- SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models (2026)
- Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models (2025)
- ShaRP: SHAllow-LayeR Pruning for Video Large Language Models Acceleration (2025)
- IPCV: Information-Preserving Compression for MLLM Visual Encoders (2025)
- FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models (2025)
- Efficient Vision-Language Reasoning via Adaptive Token Pruning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
arXivlens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/less-is-more-until-it-breaks-security-pitfalls-of-vision-token-compression-in-large-vision-language-models-8563-e9ad365b
- Executive Summary
- Detailed Breakdown
- Practical Applications
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper