benchmark 2025

Minimal neuron ablation triggers catastrophic collapse in the language core of Large Vision-Language Models

Cen Lu ¹, Yung-Chen Tang ¹, Andrea Cavallaro ^1,2

¹ EPFL

² Idiap Research Institute

0 citations · 36 references · arXiv

Published on arXiv

2512.00918

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Masking only 4 neurons in LLaVA-1.5-7b-hf's language model FFN down-projection layer triggers catastrophic output collapse (repetitive sequences or empty outputs) across all inputs, revealing an extreme structural vulnerability concentrated in the language core of VLMs.

CAN (Consistently Activated Neurons)

Novel technique introduced

Large Vision-Language Models (LVLMs) have shown impressive multimodal understanding capabilities, yet their robustness is poorly understood. In this paper, we investigate the structural vulnerabilities of LVLMs to identify any critical neurons whose removal triggers catastrophic collapse. In this context, we propose CAN, a method to detect Consistently Activated Neurons and to locate critical neurons by progressive masking. Experiments on LLaVA-1.5-7b-hf and InstructBLIP-Vicuna-7b reveal that masking only a tiny portion of the language model's feed-forward networks (just as few as four neurons in extreme cases) suffices to trigger catastrophic collapse. Notably, critical neurons are predominantly localized in the language model rather than in the vision components, and the down-projection layer is a particularly vulnerable structure. We also observe a consistent two-stage collapse pattern: initial expressive degradation followed by sudden, complete collapse. Our findings provide important insights for safety research in LVLMs.

Key Contributions

Proposes CAN (Consistently Activated Neurons), a method to detect consistently activated neurons and rank critical neurons in LVLMs via progressive masking across vision encoder, cross-modal alignment, and language model components.
Demonstrates that masking as few as four neurons in the language model FFN of LLaVA-1.5-7b-hf triggers catastrophic collapse — an extreme structural fragility far beyond prior findings on domain-specific neuron ablation.
Identifies that critical neurons are predominantly localized in the language model (not vision components) and reveals a consistent two-stage collapse pattern: expressive degradation followed by sudden complete failure.

🛡️ Threat Analysis

Model Poisoning

The paper demonstrates that direct weight/parameter manipulation — masking as few as four specific neurons in the language model FFN — causes catastrophic model failure. Per the guidelines, direct weight/parameter manipulation attacks map to ML10. While the attack does not embed hidden triggered behavior in the classical backdoor sense, it exploits structural fragility through targeted weight modification to fully disable model functionality.

Details

Domains

multimodalvisionnlp

Model Types

vlmtransformer

Threat Tags

white_boxtargeted

Applications

visual question answeringimage captioningmultimodal assistants

Read PDF arXiv DOI

Minimal neuron ablation triggers catastrophic collapse in the language core of Large Vision-Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Robust Defense Strategies for Multimodal Contrastive Learning: Efficient Fine-tuning Against Backdoor Attacks

Assimilation Matters: Model-level Backdoor Detection in Vision-Language Pretrained Models

BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

Test-Time Attention Purification for Backdoored Large Vision Language Models

Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models

Backdoor Unlearning by Linear Task Decomposition

Backdoor Vectors: a Task Arithmetic View on Backdoor Attacks and Defenses

MTAttack: Multi-Target Backdoor Attacks against Large Vision-Language Models