defense 2025

UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Zhaolong Su 1, Wang Lu 2, Hao Chen 3, Sharon Li 4, Jindong Wang 1

0 citations · 46 references · arXiv

α

Published on arXiv

2511.19413

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

UniGame improves adversarial robustness by +6.2% on AdVQA and OOD robustness by +4.8% on NaturalBench, while also improving understanding (+3.6%) and generation quality, with less than 1% additional parameters.

UniGame / PerturbNet

Novel technique introduced


Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame


Key Contributions

  • PerturbNet: a lightweight modulator that injects bounded, targeted perturbations into shared visual-token embeddings at dual injection points (understanding path and image decoder)
  • Closed-loop minimax self-play framework coupling generation and understanding branches to continually challenge fragile vision-language decision boundaries
  • Hardness-aware mining buffer combining answer difficulty and CLIP-based semantic plausibility for curriculum-driven robustification

🛡️ Threat Analysis

Input Manipulation Attack

Proposes adversarial training as a defense: a lightweight PerturbNet generates bounded perturbations of shared visual tokens in a minimax game to harden the understanding branch against adversarial inputs. Evaluates adversarial robustness (+6.2% on AdVQA), which is a primary stated outcome.


Details

Domains
multimodalvisionnlp
Model Types
vlmmultimodal
Threat Tags
white_boxtraining_timedigitaluntargeted
Datasets
NaturalBenchAdVQA
Applications
visual question answeringmultimodal understanding and generation