defense 2026

Combating Pattern and Content Bias: Adversarial Feature Learning for Generalized AI-Generated Image Detection

Haifeng Zhang 1, Qinghui He 1, Xiuli Bi 1, Bo Liu 1, Chi-Man Pun 2, Bin Xiao 1,3

0 citations

α

Published on arXiv

2604.12353

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Outperforms SOTA methods by 10.89% in accuracy and 8.57% in AP; achieves over 80% detection accuracy when trained with only 320 images

MAFL (Multi-dimensional Adversarial Feature Learning)

Novel technique introduced


In recent years, the rapid development of generative artificial intelligence technology has significantly lowered the barrier to creating high-quality fake images, posing a serious challenge to information authenticity and credibility. Existing generated image detection methods typically enhance generalization through model architecture or network design. However, their generalization performance remains susceptible to data bias, as the training data may drive models to fit specific generative patterns and content rather than the common features shared by images from different generative models (asymmetric bias learning). To address this issue, we propose a Multi-dimensional Adversarial Feature Learning (MAFL) framework. The framework adopts a pretrained multimodal image encoder as the feature extraction backbone, constructs a real-fake feature learning network, and designs an adversarial bias-learning branch equipped with a multi-dimensional adversarial loss, forming an adversarial training mechanism between authenticity-discriminative feature learning and bias feature learning. By suppressing generation-pattern and content biases, MAFL guides the model to focus on the generative features shared across different generative models, thereby effectively capturing the fundamental differences between real and generated images, enhancing cross-model generalization, and substantially reducing the reliance on large-scale training data. Through extensive experimental validation, our method outperforms existing state-of-the-art approaches by 10.89% in accuracy and 8.57% in Average Precision (AP). Notably, even when trained with only 320 images, it can still achieve over 80% detection accuracy on public datasets.


Key Contributions

  • Multi-dimensional Adversarial Feature Learning (MAFL) framework that uses adversarial training to suppress generative pattern and content biases
  • Adversarial bias-learning branch with multi-dimensional adversarial loss that guides the model to focus on shared generative features across different generator types
  • Achieves 10.89% accuracy improvement over SOTA with strong data efficiency (80%+ detection accuracy with only 320 training images)

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses output integrity and content authenticity by detecting AI-generated images (deepfakes) across different generative models (GANs, diffusion models). The paper's primary contribution is improving detection of synthetic content provenance and distinguishing real from AI-generated images.


Details

Domains
visiongenerative
Model Types
gandiffusionmultimodal
Threat Tags
inference_timedigital
Applications
ai-generated image detectiondeepfake detectionimage forensics