Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics
Published on arXiv
2508.00784
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Linear classifiers trained on intermediate-layer features of multi-modal pre-trained models surpass or match strong baselines for deepfake detection across image and audio modalities with strong cross-generator generalization
Multi-Modal Layer Analysis for Synthetic Content Detection
Novel technique introduced
Generative models achieve remarkable results in multiple data domains, including images and texts, among other examples. Unfortunately, malicious users exploit synthetic media for spreading misinformation and disseminating deepfakes. Consequently, the need for robust and stable fake detectors is pressing, especially when new generative models appear everyday. While the majority of existing work train classifiers that discriminate between real and fake information, such tools typically generalize only within the same family of generators and data modalities, yielding poor results on other generative classes and data domains. Towards a universal classifier, we propose the use of large pre-trained multi-modal models for the detection of generative content. Effectively, we show that the latent code of these models naturally captures information discriminating real from fake. Building on this observation, we demonstrate that linear classifiers trained on these features can achieve state-of-the-art results across various modalities, while remaining computationally efficient, fast to train, and effective even in few-shot settings. Our work primarily focuses on fake detection in audio and images, achieving performance that surpasses or matches that of strong baseline methods.
Key Contributions
- Extends the CLIP-ViT-based deepfake detection paradigm to a multi-modal setting covering both images and audio, leveraging latent representations of large pre-trained multi-modal models (e.g., CLIP-ViT, ImageBind, LanguageBind)
- Proposes a layer analysis methodology showing that intermediate layers — not final or initial layers — provide optimal separability between real and synthetic content, motivating a 'sweet spot' hypothesis for forensic classification
- Demonstrates that lightweight linear classifiers trained on these intermediate features achieve state-of-the-art cross-generator generalization in a computationally efficient, few-shot-capable framework
🛡️ Threat Analysis
Primary contribution is a novel deepfake detection method — a universal forensic classifier for AI-generated content (images and audio) using latent representations from intermediate layers of large pre-trained multi-modal models. This directly addresses output integrity and synthetic content detection.