Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models
Chengyin Hu , Yuxian Dong , Yikun Guo , Xiang Chen , Junqi Wu , Jiahuan Long , Yiwei Wei , Tingsong Jiang , Wen Yao
Published on arXiv
2604.03117
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Single universal patch consistently degrades semantic understanding across diverse IR-VLM architectures while maintaining cross-model transferability, cross-dataset generalization, and physical-world effectiveness
UCGP (Universal Curved-Grid Patch)
Novel technique introduced
Infrared vision-language models (IR-VLMs) have emerged as a promising paradigm for multimodal perception in low-visibility environments, yet their robustness to adversarial attacks remains largely unexplored. Existing adversarial patch methods are mainly designed for RGB-based models in closed-set settings and are not readily applicable to the open-ended semantic understanding and physical deployment requirements of infrared VLMs. To bridge this gap, we propose Universal Curved-Grid Patch (UCGP), a universal physical adversarial patch framework for IR-VLMs. UCGP integrates Curved-Grid Mesh (CGM) parameterization for continuous, low-frequency, and deployable patch generation with a unified representation-driven objective that promotes subspace departure, topology disruption, and stealth. To improve robustness under real-world deployment and domain shift, we further incorporate Meta Differential Evolution and EOT-augmented TPS deformation modeling. Rather than manipulating labels or prompts, UCGP directly disrupts the visual representation space, weakening cross-modal semantic alignment. Extensive experiments demonstrate that UCGP consistently compromises semantic understanding across diverse IR-VLM architectures while maintaining cross-model transferability, cross-dataset generalization, real-world physical effectiveness, and robustness against defenses. These findings reveal a previously overlooked robustness vulnerability in current infrared multimodal systems.
Key Contributions
- First universal physical adversarial patch framework (UCGP) specifically designed for infrared vision-language models
- Curved-Grid Mesh parameterization for continuous, low-frequency, physically deployable patches with representation-driven objectives
- Demonstrated cross-model transferability, cross-dataset generalization, and real-world physical effectiveness against IR-VLMs across classification, captioning, and VQA tasks
🛡️ Threat Analysis
Proposes adversarial patches that manipulate infrared VLM behavior at inference time by disrupting visual representations and cross-modal alignment. The attack crafts deployable physical perturbations that cause misclassification and semantic deviations in open-ended outputs (captioning, VQA). This is a clear input manipulation attack on multimodal models.