attack 2026

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

Tianle Chen , Deepti Ghadiyaram

0 citations

α

Published on arXiv

2604.03995

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Coordinated cross-modal typographic attacks achieve 83.43% attack success rate compared to 34.93% for single-modality attacks on frontier audio-visual MLLMs

Multi-Modal Typography

Novel technique introduced


As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = $83.43\%$ vs $34.93\%$).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.


Key Contributions

  • First systematic study of cross-modal typographic attacks on audio-visual MLLMs
  • Demonstrates that coordinated multi-modal attacks achieve 83.43% success rate vs 34.93% for single-modality attacks
  • Evaluates attacks across multiple frontier MLLMs on common-sense reasoning and content moderation tasks

🛡️ Threat Analysis

Input Manipulation Attack

Typographic attacks are input manipulation attacks that craft adversarial text perturbations across audio and visual modalities to cause misclassification and bypass content moderation at inference time.


Details

Domains
multimodalaudiovisionnlp
Model Types
multimodalvlmllm
Threat Tags
inference_timetargeteddigital
Applications
audio-visual reasoningcontent moderationcommon-sense reasoning