Make Anything Match Your Target: Universal Adversarial Perturbations against Closed-Source MLLMs via Multi-Crop Routed Meta Optimization

Targeted adversarial attacks on closed-source multimodal large language models (MLLMs) have been increasingly explored under black-box transfer, yet prior methods are predominantly sample-specific and offer limited reusability across inputs. We instead study a more stringent setting, Universal Targeted Transferable Adversarial Attacks (UTTAA), where a single perturbation must consistently steer arbitrary inputs toward a specified target across unknown commercial MLLMs. Naively adapting existing sample-wise attacks to this universal setting faces three core difficulties: (i) target supervision becomes high-variance due to target-crop randomness, (ii) token-wise matching is unreliable because universality suppresses image-specific cues that would otherwise anchor alignment, and (iii) few-source per-target adaptation is highly initialization-sensitive, which can degrade the attainable performance. In this work, we propose MCRMO-Attack, which stabilizes supervision via Multi-Crop Aggregation with an Attention-Guided Crop, improves token-level reliability through alignability-gated Token Routing, and meta-learns a cross-target perturbation prior that yields stronger per-target solutions. Across commercial MLLMs, we boost unseen-image attack success rate by +23.7\% on GPT-4o and +19.9\% on Gemini-2.0 over the strongest universal baseline.

Key Contributions

Defines the UTTAA (Universal Targeted Transferable Adversarial Attacks) setting: a single perturbation steers arbitrary unseen inputs toward a specified target on closed-source MLLMs
Multi-Crop Aggregation with Attention-Guided Crop (MCA+AGC) to reduce target supervision variance from crop randomness
Alignability-gated Token Routing for reliable token-level matching and meta-learned cross-target perturbation prior for initialization-robust few-shot adaptation

🛡️ Threat Analysis

Input Manipulation Attack

Proposes gradient-based universal adversarial perturbations applied to images that cause targeted misclassification/incorrect outputs in VLMs at inference time — a direct input manipulation attack.

Details

Domains

visionnlpmultimodal

Model Types

vlmllmmultimodal

Threat Tags

black_boxinference_timetargeteddigital

Applications

2026 0 cit.

Input Manipulation Attack

95%