A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning
Tianle Chen , Deepti Ghadiyaram
Published on arXiv
2604.03995
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Coordinated cross-modal typographic attacks achieve 83.43% attack success rate compared to 34.93% for single-modality attacks on frontier audio-visual MLLMs
Multi-Modal Typography
Novel technique introduced
As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = $83.43\%$ vs $34.93\%$).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.
Key Contributions
- First systematic study of cross-modal typographic attacks on audio-visual MLLMs
- Demonstrates that coordinated multi-modal attacks achieve 83.43% success rate vs 34.93% for single-modality attacks
- Evaluates attacks across multiple frontier MLLMs on common-sense reasoning and content moderation tasks
🛡️ Threat Analysis
Typographic attacks are input manipulation attacks that craft adversarial text perturbations across audio and visual modalities to cause misclassification and bypass content moderation at inference time.