α

Published on arXiv

2604.19083

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Backdoor-critical parameters in MLLM projectors are encoded in low-rank subspaces with activation magnitude scaling linearly with input norm, enabling distinction between clean and poisoned samples

ProjLens

Novel technique introduced


Multimodal Large Language Models (MLLMs) have achieved remarkable success in cross-modal understanding and generation, yet their deployment is threatened by critical safety vulnerabilities. While prior works have demonstrated the feasibility of backdoors in MLLMs via fine-tuning data poisoning to manipulate inference, the underlying mechanisms of backdoor attacks remain opaque, complicating the understanding and mitigation. To bridge this gap, we propose ProjLens, an interpretability framework designed to demystify MLLMs backdoors. We first establish that normal downstream task alignment--even when restricted to projector fine--tuning--introduces vulnerability to backdoor injection, whose activation mechanism is different from that observed in text-only LLMs. Through extensive experiments across four backdoor variants, we uncover:(1) Low-Rank Structure: Backdoor injection updates appear overall full-rank and lack dedicated ``trigger neurons'', but the backdoor-critical parameters are encoded within a low-rank subspace of the projector;(2) Activation Mechanism: Both clean and poisoned embedding undergoes a semantic shift toward a shared direction aligned with the backdoor target, but the shifting magnitude scales linearly with the input norm, resulting in the distinct backdoor activation on poisoned samples. Our code is available at: https://anonymous.4open.science/r/ProjLens-8FD7


Key Contributions

  • Interpretability framework (ProjLens) revealing that MLLM backdoors are encoded in low-rank subspaces of projector parameters, not dedicated trigger neurons
  • Discovery that backdoor activation operates via norm-scaled semantic drift toward target embeddings, differing from text-only LLM backdoor mechanisms
  • Experimental analysis across four backdoor variants showing projector fine-tuning alone introduces backdoor vulnerability in MLLMs

🛡️ Threat Analysis

Model Poisoning

Paper analyzes backdoor attack mechanisms in MLLMs, specifically how backdoor triggers are encoded in projector parameters and how they activate during inference. While the paper is interpretability-focused rather than proposing new attacks, it studies the fundamental mechanisms of backdoor injection and activation in multimodal models.


Details

Domains
multimodalnlpvision
Model Types
vlmmultimodaltransformer
Threat Tags
training_timetargeted
Applications
multimodal understandingvision-language models