Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security
Wei Zhao , Zhe Li , Yige Li , Jun Sun
Published on arXiv
2511.16229
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Q-MLLM achieves 100% defense success rate against adversarial visual jailbreak attacks on MLLMs while maintaining competitive performance on multimodal utility benchmarks with minimal inference overhead.
Q-MLLM
Novel technique introduced
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in cross-modal understanding, but remain vulnerable to adversarial attacks through visual inputs despite robust textual safety mechanisms. These vulnerabilities arise from two core weaknesses: the continuous nature of visual representations, which allows for gradient-based attacks, and the inadequate transfer of text-based safety mechanisms to visual content. We introduce Q-MLLM, a novel architecture that integrates two-level vector quantization to create a discrete bottleneck against adversarial attacks while preserving multimodal reasoning capabilities. By discretizing visual representations at both pixel-patch and semantic levels, Q-MLLM blocks attack pathways and bridges the cross-modal safety alignment gap. Our two-stage training methodology ensures robust learning while maintaining model utility. Experiments demonstrate that Q-MLLM achieves significantly better defense success rate against both jailbreak attacks and toxic image attacks than existing approaches. Notably, Q-MLLM achieves perfect defense success rate (100\%) against jailbreak attacks except in one arguable case, while maintaining competitive performance on multiple utility benchmarks with minimal inference overhead. This work establishes vector quantization as an effective defense mechanism for secure multimodal AI systems without requiring expensive safety-specific fine-tuning or detection overhead. Code is available at https://github.com/Amadeuszhao/QMLLM.
Key Contributions
- Two-level vector quantization architecture (pixel-patch and semantic levels) creating a discrete bottleneck that blocks gradient-based adversarial attack pathways on visual inputs to MLLMs
- Two-stage training methodology that preserves multimodal reasoning utility while enforcing adversarial robustness through discretization
- Empirical demonstration of 100% defense success rate against jailbreak attacks with minimal inference overhead and no need for expensive safety-specific fine-tuning
🛡️ Threat Analysis
The paper defends against gradient-based adversarial perturbations on visual inputs to MLLMs — the continuous nature of visual representations enabling gradient-based attacks is explicitly identified as the core vulnerability being addressed. The defense (vector quantization bottleneck) specifically blocks adversarial perturbation pathways at inference time.