defense 2025

Robust Vision-Language Models via Tensor Decomposition: A Defense Against Adversarial Attacks

Het Patel , Muzammil Allie , Qian Zhang , Jia Chen , Evangelos E. Papalexakis

0 citations

α

Published on arXiv

2509.16163

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

On Flickr30K, recovers 12.3% absolute Recall@1 performance lost to adversarial attacks (7.5%→19.8%); on COCO, recovers 8.1% (3.8%→11.9%).

Tensor Train decomposition defense

Novel technique introduced


Vision language models (VLMs) excel in multimodal understanding but are prone to adversarial attacks. Existing defenses often demand costly retraining or significant architecture changes. We introduce a lightweight defense using tensor decomposition suitable for any pre-trained VLM, requiring no retraining. By decomposing and reconstructing vision encoder representations, it filters adversarial noise while preserving meaning. Experiments with CLIP on COCO and Flickr30K show improved robustness. On Flickr30K, it restores 12.3\% performance lost to attacks, raising Recall@1 accuracy from 7.5\% to 19.8\%. On COCO, it recovers 8.1\% performance, improving accuracy from 3.8\% to 11.9\%. Analysis shows Tensor Train decomposition with low rank (8-32) and low residual strength ($α=0.1-0.2$) is optimal. This method is a practical, plug-and-play solution with minimal overhead for existing VLMs.


Key Contributions

  • Lightweight, plug-and-play tensor decomposition defense applicable to any pre-trained VLM without retraining or architectural modification
  • Comprehensive analysis of decomposition parameters (rank, residual strength, target layers), finding Tensor Train with rank 8-32 and α=0.1-0.2 is optimal
  • Empirical robustness improvements on COCO and Flickr30K: recovering 8.1% and 12.3% of Recall@1 lost to adversarial attacks respectively

🛡️ Threat Analysis

Input Manipulation Attack

Directly defends against adversarial input perturbations on VLMs (specifically CLIP) by filtering adversarial noise from vision encoder intermediate representations using low-rank tensor decomposition — a classic input manipulation attack defense at inference time.


Details

Domains
visionmultimodal
Model Types
vlmtransformer
Threat Tags
inference_timedigital
Datasets
COCOFlickr30K
Applications
image-text retrievalmultimodal understandingvision-language models