ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
Kun Wang 1, Cheng Qian 2, Miao Yu 3, Lilan Peng 4, Liang Lin 1, Jiaming Zhang 3, Tianyu Zhang 5, Yu Cheng 3, Yang Wang 1
1 Nanyang Technological University
2 University of Science and Technology of China
3 arXiv
Published on arXiv
2604.19083
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
Backdoor-critical parameters in MLLM projectors are encoded in low-rank subspaces with activation magnitude scaling linearly with input norm, enabling distinction between clean and poisoned samples
ProjLens
Novel technique introduced
Multimodal Large Language Models (MLLMs) have achieved remarkable success in cross-modal understanding and generation, yet their deployment is threatened by critical safety vulnerabilities. While prior works have demonstrated the feasibility of backdoors in MLLMs via fine-tuning data poisoning to manipulate inference, the underlying mechanisms of backdoor attacks remain opaque, complicating the understanding and mitigation. To bridge this gap, we propose ProjLens, an interpretability framework designed to demystify MLLMs backdoors. We first establish that normal downstream task alignment--even when restricted to projector fine--tuning--introduces vulnerability to backdoor injection, whose activation mechanism is different from that observed in text-only LLMs. Through extensive experiments across four backdoor variants, we uncover:(1) Low-Rank Structure: Backdoor injection updates appear overall full-rank and lack dedicated ``trigger neurons'', but the backdoor-critical parameters are encoded within a low-rank subspace of the projector;(2) Activation Mechanism: Both clean and poisoned embedding undergoes a semantic shift toward a shared direction aligned with the backdoor target, but the shifting magnitude scales linearly with the input norm, resulting in the distinct backdoor activation on poisoned samples. Our code is available at: https://anonymous.4open.science/r/ProjLens-8FD7
Key Contributions
- Interpretability framework (ProjLens) revealing that MLLM backdoors are encoded in low-rank subspaces of projector parameters, not dedicated trigger neurons
- Discovery that backdoor activation operates via norm-scaled semantic drift toward target embeddings, differing from text-only LLM backdoor mechanisms
- Experimental analysis across four backdoor variants showing projector fine-tuning alone introduces backdoor vulnerability in MLLMs
🛡️ Threat Analysis
Paper analyzes backdoor attack mechanisms in MLLMs, specifically how backdoor triggers are encoded in projector parameters and how they activate during inference. While the paper is interpretability-focused rather than proposing new attacks, it studies the fundamental mechanisms of backdoor injection and activation in multimodal models.