MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models
Yuqi Li 1,2, Junhao Dong 3, Chuanguang Yang 1, Shiping Wen 4,5,6, Piotr Koniusz , Tingwen Huang , Yingli Tian 4, Yew-Soon Ong 1
1 Nanyang Technological University
2 Institute of Computing Technology, Chinese Academy of Sciences
3 University of Technology Sydney
Published on arXiv
2511.17448
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on ViT-B-32 while achieving 2.3x training efficiency over single-teacher adversarial distillation methods.
MMT-ARD
Novel technique introduced
Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited knowledge diversity, slow convergence, and difficulty in balancing robustness and accuracy. To address these challenges, we propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Robust Distillation framework. Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimizes clean feature preservation and robust feature enhancement. To better handle challenging adversarial examples, we introduce a dynamic weight allocation strategy based on teacher confidence, enabling adaptive focus on harder samples. Moreover, to mitigate bias among teachers, we design an adaptive sigmoid-based weighting function that balances the strength of knowledge transfer across modalities. Extensive experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on the ViT-B-32 model, while achieving a 2.3x increase in training efficiency over traditional single-teacher methods. These results highlight the effectiveness and scalability of MMT-ARD in enhancing the adversarial robustness of multimodal large models. Our codes are available at https://github.com/itsnotacie/MMT-ARD.
Key Contributions
- Dual-teacher knowledge fusion architecture that jointly optimizes clean feature preservation and robust feature enhancement in VLMs
- Dynamic weight allocation strategy based on teacher confidence to adaptively focus training on harder adversarial examples
- Adaptive sigmoid-based weighting function that mitigates teacher bias and balances cross-modal knowledge transfer strength
🛡️ Threat Analysis
Directly defends against adversarial input manipulation attacks on VLMs — the framework is designed to improve adversarial robustness (resistance to gradient-based perturbations at inference time) through knowledge distillation from multiple robust teacher models.