defense 2025

CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images

Bo Liu 1,2, Qiao Qin 1,3, Qinghui He 1,3

2 citations · 43 references · arXiv

α

Published on arXiv

2512.13285

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

CausalCLIP improves cross-generator generalization by 6.83% in accuracy and 4.06% in average precision over SOTA by disentangling causal forensic cues from spurious patterns in CLIP features.

CausalCLIP

Novel technique introduced


The rapid advancement of generative models has increased the demand for generated image detectors capable of generalizing across diverse and evolving generation techniques. However, existing methods, including those leveraging pre-trained vision-language models, often produce highly entangled representations, mixing task-relevant forensic cues (causal features) with spurious or irrelevant patterns (non-causal features), thus limiting generalization. To address this issue, we propose CausalCLIP, a framework that explicitly disentangles causal from non-causal features and employs targeted filtering guided by causal inference principles to retain only the most transferable and discriminative forensic cues. By modeling the generation process with a structural causal model and enforcing statistical independence through Gumbel-Softmax-based feature masking and Hilbert-Schmidt Independence Criterion (HSIC) constraints, CausalCLIP isolates stable causal features robust to distribution shifts. When tested on unseen generative models from different series, CausalCLIP demonstrates strong generalization ability, achieving improvements of 6.83% in accuracy and 4.06% in average precision over state-of-the-art methods.


Key Contributions

  • CausalCLIP: a disentangle-then-filter framework that separates stable causal forensic features from spurious non-causal features in CLIP representations using a structural causal model
  • Gumbel-Softmax-based feature masking with HSIC statistical independence constraints to isolate distribution-shift-robust forensic cues
  • Achieves 6.83% accuracy and 4.06% average precision improvements over SOTA on detection of unseen generative model families

🛡️ Threat Analysis

Output Integrity Attack

Primary contribution is a novel forensic detection framework for AI-generated images, directly addressing output integrity and content authenticity by distinguishing real from synthetic images across diverse GAN and diffusion model families.


Details

Domains
vision
Model Types
transformergandiffusion
Threat Tags
inference_time
Datasets
ProGANStyleGAN2StyleGAN3ADMGlideStable Diffusion
Applications
ai-generated image detectiondeepfake detectiongenerated content forensics