Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics

Generative models achieve remarkable results in multiple data domains, including images and texts, among other examples. Unfortunately, malicious users exploit synthetic media for spreading misinformation and disseminating deepfakes. Consequently, the need for robust and stable fake detectors is pressing, especially when new generative models appear everyday. While the majority of existing work train classifiers that discriminate between real and fake information, such tools typically generalize only within the same family of generators and data modalities, yielding poor results on other generative classes and data domains. Towards a universal classifier, we propose the use of large pre-trained multi-modal models for the detection of generative content. Effectively, we show that the latent code of these models naturally captures information discriminating real from fake. Building on this observation, we demonstrate that linear classifiers trained on these features can achieve state-of-the-art results across various modalities, while remaining computationally efficient, fast to train, and effective even in few-shot settings. Our work primarily focuses on fake detection in audio and images, achieving performance that surpasses or matches that of strong baseline methods.

Key Contributions

Extends the CLIP-ViT-based deepfake detection paradigm to a multi-modal setting covering both images and audio, leveraging latent representations of large pre-trained multi-modal models (e.g., CLIP-ViT, ImageBind, LanguageBind)
Proposes a layer analysis methodology showing that intermediate layers — not final or initial layers — provide optimal separability between real and synthetic content, motivating a 'sweet spot' hypothesis for forensic classification
Demonstrates that lightweight linear classifiers trained on these intermediate features achieve state-of-the-art cross-generator generalization in a computationally efficient, few-shot-capable framework

🛡️ Threat Analysis

Output Integrity Attack

Primary contribution is a novel deepfake detection method — a universal forensic classifier for AI-generated content (images and audio) using latent representations from intermediate layers of large pre-trained multi-modal models. This directly addresses output integrity and synthetic content detection.

Details

Domains

visionaudiomultimodal

Model Types

transformervlmmultimodal

Threat Tags

inference_time

Datasets

FaceForensics++

Applications

2026 0 cit.

Output Integrity Attack

87%

Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Leveraging large multimodal models for audio-video deepfake detection: a pilot study

SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection

Weakly Supervised Multimodal Temporal Forgery Localization via Multitask Learning

Referee: Reference-aware Audiovisual Deepfake Detection

Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling

Word-Anchored Temporal Forgery Localization

Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization

SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment