defense arXiv Nov 22, 2025 · Nov 2025
Junrui Zhang, Xinyu Zhao, Jie Peng et al. · University of Science & Technology of China · University of North Carolina at Chapel Hill +1 more
Adversarial training defense that quantifies per-modality vulnerability to selectively harden multimodal models against adversarial attacks
Input Manipulation Attack multimodal
Multimodal learning has shown significant superiority on various tasks by integrating multiple modalities. However, the interdependencies among modalities increase the susceptibility of multimodal models to adversarial attacks. Existing methods mainly focus on attacks on specific modalities or indiscriminately attack all modalities. In this paper, we find that these approaches ignore the differences between modalities in their contribution to final robustness, resulting in suboptimal robustness performance. To bridge this gap, we introduce Vulnerability-Aware Robust Multimodal Adversarial Training (VARMAT), a probe-in-training adversarial training method that improves multimodal robustness by identifying the vulnerability of each modality. To be specific, VARMAT first explicitly quantifies the vulnerability of each modality, grounded in a first-order approximation of the attack objective (Probe). Then, we propose a targeted regularization term that penalizes modalities with high vulnerability, guiding robust learning while maintaining task accuracy (Training). We demonstrate the enhanced robustness of our method across multiple multimodal datasets involving diverse modalities. Finally, we achieve {12.73%, 22.21%, 11.19%} robustness improvement on three multimodal datasets, revealing a significant blind spot in multimodal adversarial training.
multimodal transformer University of Science & Technology of China · University of North Carolina at Chapel Hill · Hefei Comprehensive National Science Center
defense arXiv Oct 19, 2025 · Oct 2025
Pingzhi Li, Morris Yu-Chao Huang, Zhen Tan et al. · UNC-Chapel Hill · Arizona State University +4 more
Detects LLM knowledge distillation (model theft) by fingerprinting MoE expert routing patterns in both white-box and black-box settings
Model Theft nlp
Knowledge Distillation (KD) accelerates training of large language models (LLMs) but poses intellectual property protection and LLM diversity risks. Existing KD detection methods based on self-identity or output similarity can be easily evaded through prompt engineering. We present a KD detection framework effective in both white-box and black-box settings by exploiting an overlooked signal: the transfer of MoE "structural habits", especially internal routing patterns. Our approach analyzes how different experts specialize and collaborate across various inputs, creating distinctive fingerprints that persist through the distillation process. To extend beyond the white-box setup and MoE architectures, we further propose Shadow-MoE, a black-box method that constructs proxy MoE representations via auxiliary distillation to compare these patterns between arbitrary model pairs. We establish a comprehensive, reproducible benchmark that offers diverse distilled checkpoints and an extensible framework to facilitate future research. Extensive experiments demonstrate >94% detection accuracy across various scenarios and strong robustness to prompt-based evasion, outperforming existing baselines while highlighting the structural habits transfer in LLMs.
llm transformer UNC-Chapel Hill · Arizona State University · Individual Contributor +3 more