ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

Multimodal Large Language Models (MLLMs) have achieved remarkable success in cross-modal understanding and generation, yet their deployment is threatened by critical safety vulnerabilities. While prior works have demonstrated the feasibility of backdoors in MLLMs via fine-tuning data poisoning to manipulate inference, the underlying mechanisms of backdoor attacks remain opaque, complicating the understanding and mitigation. To bridge this gap, we propose ProjLens, an interpretability framework designed to demystify MLLMs backdoors. We first establish that normal downstream task alignment--even when restricted to projector fine--tuning--introduces vulnerability to backdoor injection, whose activation mechanism is different from that observed in text-only LLMs. Through extensive experiments across four backdoor variants, we uncover:(1) Low-Rank Structure: Backdoor injection updates appear overall full-rank and lack dedicated ``trigger neurons'', but the backdoor-critical parameters are encoded within a low-rank subspace of the projector;(2) Activation Mechanism: Both clean and poisoned embedding undergoes a semantic shift toward a shared direction aligned with the backdoor target, but the shifting magnitude scales linearly with the input norm, resulting in the distinct backdoor activation on poisoned samples. Our code is available at: https://anonymous.4open.science/r/ProjLens-8FD7

Key Contributions

Interpretability framework (ProjLens) revealing that MLLM backdoors are encoded in low-rank subspaces of projector parameters, not dedicated trigger neurons
Discovery that backdoor activation operates via norm-scaled semantic drift toward target embeddings, differing from text-only LLM backdoor mechanisms
Experimental analysis across four backdoor variants showing projector fine-tuning alone introduces backdoor vulnerability in MLLMs

🛡️ Threat Analysis

Model Poisoning

Paper analyzes backdoor attack mechanisms in MLLMs, specifically how backdoor triggers are encoded in projector parameters and how they activate during inference. While the paper is interpretability-focused rather than proposing new attacks, it studies the fundamental mechanisms of backdoor injection and activation in multimodal models.