defense 2025

Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations

Jinjie Shen 1, Yaxiong Wang 1, Lechao Cheng 1, Nan Pu 2, Zhun Zhong 1

0 citations

α

Published on arXiv

2509.12653

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

RamDG achieves 2.06% higher detection accuracy on the SAMM dataset compared to state-of-the-art multimodal manipulation detection approaches.

RamDG (Retrieval-Augmented Manipulation Detection and Grounding)

Novel technique introduced


The detection and grounding of manipulated content in multimodal data has emerged as a critical challenge in media forensics. While existing benchmarks demonstrate technical progress, they suffer from misalignment artifacts that poorly reflect real-world manipulation patterns: practical attacks typically maintain semantic consistency across modalities, whereas current datasets artificially disrupt cross-modal alignment, creating easily detectable anomalies. To bridge this gap, we pioneer the detection of semantically-coordinated manipulations where visual edits are systematically paired with semantically consistent textual descriptions. Our approach begins with constructing the first Semantic-Aligned Multimodal Manipulation (SAMM) dataset, generated through a two-stage pipeline: 1) applying state-of-the-art image manipulations, followed by 2) generation of contextually-plausible textual narratives that reinforce the visual deception. Building on this foundation, we propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework. RamDG commences by harnessing external knowledge repositories to retrieve contextual evidence, which serves as the auxiliary texts and encoded together with the inputs through our image forgery grounding and deep manipulation detection modules to trace all manipulations. Extensive experiments demonstrate our framework significantly outperforms existing methods, achieving 2.06\% higher detection accuracy on SAMM compared to state-of-the-art approaches. The dataset and code are publicly available at https://github.com/shen8424/SAMM-RamDG-CAP.


Key Contributions

  • First Semantic-Aligned Multimodal Manipulation (SAMM) dataset of 260,970 samples where image manipulations are paired with semantically consistent fabricated text narratives, better reflecting real-world attack patterns
  • Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework that uses external knowledge repositories to retrieve contextual evidence for detecting and localizing manipulations
  • Demonstrates that existing benchmarks with cross-modal semantic misalignment significantly underestimate real-world detection difficulty

🛡️ Threat Analysis

Output Integrity Attack

Paper's primary contribution is detecting AI-manipulated/deepfake content in multimodal settings (face swapping, attribute editing paired with fabricated text). Proposes a new benchmark dataset and a novel retrieval-augmented detection architecture — squarely output integrity and content authenticity.


Details

Domains
visionnlpmultimodal
Model Types
multimodaltransformergenerative
Threat Tags
digitalinference_time
Datasets
SAMM (proposed)NewsCLIPingsDGM4
Applications
fake news detectiondeepfake detectionmultimodal media forensics