Minimal neuron ablation triggers catastrophic collapse in the language core of Large Vision-Language Models
Cen Lu 1, Yung-Chen Tang 1, Andrea Cavallaro 1,2
Published on arXiv
2512.00918
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
Masking only 4 neurons in LLaVA-1.5-7b-hf's language model FFN down-projection layer triggers catastrophic output collapse (repetitive sequences or empty outputs) across all inputs, revealing an extreme structural vulnerability concentrated in the language core of VLMs.
CAN (Consistently Activated Neurons)
Novel technique introduced
Large Vision-Language Models (LVLMs) have shown impressive multimodal understanding capabilities, yet their robustness is poorly understood. In this paper, we investigate the structural vulnerabilities of LVLMs to identify any critical neurons whose removal triggers catastrophic collapse. In this context, we propose CAN, a method to detect Consistently Activated Neurons and to locate critical neurons by progressive masking. Experiments on LLaVA-1.5-7b-hf and InstructBLIP-Vicuna-7b reveal that masking only a tiny portion of the language model's feed-forward networks (just as few as four neurons in extreme cases) suffices to trigger catastrophic collapse. Notably, critical neurons are predominantly localized in the language model rather than in the vision components, and the down-projection layer is a particularly vulnerable structure. We also observe a consistent two-stage collapse pattern: initial expressive degradation followed by sudden, complete collapse. Our findings provide important insights for safety research in LVLMs.
Key Contributions
- Proposes CAN (Consistently Activated Neurons), a method to detect consistently activated neurons and rank critical neurons in LVLMs via progressive masking across vision encoder, cross-modal alignment, and language model components.
- Demonstrates that masking as few as four neurons in the language model FFN of LLaVA-1.5-7b-hf triggers catastrophic collapse — an extreme structural fragility far beyond prior findings on domain-specific neuron ablation.
- Identifies that critical neurons are predominantly localized in the language model (not vision components) and reveals a consistent two-stage collapse pattern: expressive degradation followed by sudden complete failure.
🛡️ Threat Analysis
The paper demonstrates that direct weight/parameter manipulation — masking as few as four specific neurons in the language model FFN — causes catastrophic model failure. Per the guidelines, direct weight/parameter manipulation attacks map to ML10. While the attack does not embed hidden triggered behavior in the classical backdoor sense, it exploits structural fragility through targeted weight modification to fully disable model functionality.