A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = $83.43\%$ vs $34.93\%$).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.

Key Contributions

First systematic study of cross-modal typographic attacks on audio-visual MLLMs
Demonstrates that coordinated multi-modal attacks achieve 83.43% success rate vs 34.93% for single-modality attacks
Evaluates attacks across multiple frontier MLLMs on common-sense reasoning and content moderation tasks

🛡️ Threat Analysis

Input Manipulation Attack

Typographic attacks are input manipulation attacks that craft adversarial text perturbations across audio and visual modalities to cause misclassification and bypass content moderation at inference time.

Details

Domains

multimodalaudiovisionnlp

Model Types

multimodalvlmllm

Threat Tags

inference_timetargeteddigital

Applications

2026 0 cit.

Input Manipulation Attack

86%