defense 2025

M4-BLIP: Advancing Multi-Modal Media Manipulation Detection through Face-Enhanced Local Analysis

Hang Wu , Ke Sun , Jiayi Ji , Xiaoshuai Sun , Rongrong Ji

0 citations · 35 references · arXiv

α

Published on arXiv

2512.01214

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

M4-BLIP outperforms state-of-the-art multi-modal manipulation detection competitors in quantitative and visualization experiments.

M4-BLIP

Novel technique introduced


In the contemporary digital landscape, multi-modal media manipulation has emerged as a significant societal threat, impacting the reliability and integrity of information dissemination. Current detection methodologies in this domain often overlook the crucial aspect of localized information, despite the fact that manipulations frequently occur in specific areas, particularly in facial regions. In response to this critical observation, we propose the M4-BLIP framework. This innovative framework utilizes the BLIP-2 model, renowned for its ability to extract local features, as the cornerstone for feature extraction. Complementing this, we incorporate local facial information as prior knowledge. A specially designed alignment and fusion module within M4-BLIP meticulously integrates these local and global features, creating a harmonious blend that enhances detection accuracy. Furthermore, our approach seamlessly integrates with Large Language Models (LLM), significantly improving the interpretability of the detection outcomes. Extensive quantitative and visualization experiments validate the effectiveness of our framework against the state-of-the-art competitors.


Key Contributions

  • M4-BLIP framework leveraging BLIP-2 for local feature extraction with facial region priors for enhanced manipulation detection
  • Alignment and fusion module that integrates local (facial) and global multi-modal features
  • Integration with LLMs to improve interpretability and explainability of manipulation detection outcomes

🛡️ Threat Analysis

Output Integrity Attack

The paper proposes a novel forensic detection architecture for verifying the authenticity and integrity of multi-modal media (image+text manipulation, facial deepfakes), which falls squarely under output integrity and AI-generated/manipulated content detection. The framework introduces new detection methodology rather than merely applying existing detectors to a new domain.


Details

Domains
multimodalvisionnlp
Model Types
vlmllmtransformer
Threat Tags
inference_timedigital
Applications
multi-modal media manipulation detectiondeepfake detectionmisinformation detection