defense 2025

Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics

Tom Or , Omri Azencot

0 citations

α

Published on arXiv

2508.00784

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Linear classifiers trained on intermediate-layer features of multi-modal pre-trained models surpass or match strong baselines for deepfake detection across image and audio modalities with strong cross-generator generalization

Multi-Modal Layer Analysis for Synthetic Content Detection

Novel technique introduced


Generative models achieve remarkable results in multiple data domains, including images and texts, among other examples. Unfortunately, malicious users exploit synthetic media for spreading misinformation and disseminating deepfakes. Consequently, the need for robust and stable fake detectors is pressing, especially when new generative models appear everyday. While the majority of existing work train classifiers that discriminate between real and fake information, such tools typically generalize only within the same family of generators and data modalities, yielding poor results on other generative classes and data domains. Towards a universal classifier, we propose the use of large pre-trained multi-modal models for the detection of generative content. Effectively, we show that the latent code of these models naturally captures information discriminating real from fake. Building on this observation, we demonstrate that linear classifiers trained on these features can achieve state-of-the-art results across various modalities, while remaining computationally efficient, fast to train, and effective even in few-shot settings. Our work primarily focuses on fake detection in audio and images, achieving performance that surpasses or matches that of strong baseline methods.


Key Contributions

  • Extends the CLIP-ViT-based deepfake detection paradigm to a multi-modal setting covering both images and audio, leveraging latent representations of large pre-trained multi-modal models (e.g., CLIP-ViT, ImageBind, LanguageBind)
  • Proposes a layer analysis methodology showing that intermediate layers — not final or initial layers — provide optimal separability between real and synthetic content, motivating a 'sweet spot' hypothesis for forensic classification
  • Demonstrates that lightweight linear classifiers trained on these intermediate features achieve state-of-the-art cross-generator generalization in a computationally efficient, few-shot-capable framework

🛡️ Threat Analysis

Output Integrity Attack

Primary contribution is a novel deepfake detection method — a universal forensic classifier for AI-generated content (images and audio) using latent representations from intermediate layers of large pre-trained multi-modal models. This directly addresses output integrity and synthetic content detection.


Details

Domains
visionaudiomultimodal
Model Types
transformervlmmultimodal
Threat Tags
inference_time
Datasets
FaceForensics++
Applications
deepfake detectionai-generated image detectionaudio deepfake detectionsynthetic media forensics