Robust Vision-Language Models via Tensor Decomposition: A Defense Against Adversarial Attacks

Vision language models (VLMs) excel in multimodal understanding but are prone to adversarial attacks. Existing defenses often demand costly retraining or significant architecture changes. We introduce a lightweight defense using tensor decomposition suitable for any pre-trained VLM, requiring no retraining. By decomposing and reconstructing vision encoder representations, it filters adversarial noise while preserving meaning. Experiments with CLIP on COCO and Flickr30K show improved robustness. On Flickr30K, it restores 12.3\% performance lost to attacks, raising Recall@1 accuracy from 7.5\% to 19.8\%. On COCO, it recovers 8.1\% performance, improving accuracy from 3.8\% to 11.9\%. Analysis shows Tensor Train decomposition with low rank (8-32) and low residual strength ($α=0.1-0.2$) is optimal. This method is a practical, plug-and-play solution with minimal overhead for existing VLMs.

Key Contributions

Lightweight, plug-and-play tensor decomposition defense applicable to any pre-trained VLM without retraining or architectural modification
Comprehensive analysis of decomposition parameters (rank, residual strength, target layers), finding Tensor Train with rank 8-32 and α=0.1-0.2 is optimal
Empirical robustness improvements on COCO and Flickr30K: recovering 8.1% and 12.3% of Recall@1 lost to adversarial attacks respectively

🛡️ Threat Analysis

Input Manipulation Attack

Directly defends against adversarial input perturbations on VLMs (specifically CLIP) by filtering adversarial noise from vision encoder intermediate representations using low-rank tensor decomposition — a classic input manipulation attack defense at inference time.

Details

Domains

visionmultimodal

Model Types

vlmtransformer

Threat Tags

inference_timedigital

Datasets

COCOFlickr30K

Applications

2025 0 cit.

Input Manipulation Attack

86%