attack 2026

Fragile Reconstruction: Adversarial Vulnerability of Reconstruction-Based Detectors for Diffusion-Generated Images

Haoyang Jiang 1,2, Mingyang Yi 1, Shaolei Zhang 1, Junxian Cai 2, Qingbin Liu 2, Xi Chen 2, Ju Fan 1

0 citations

α

Published on arXiv

2604.12781

Input Manipulation Attack

OWASP ML Top 10 — ML01

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Adversarial attacks degrade detection accuracy to near zero across three representative detectors and four generative backbone models with imperceptible perturbations (ε=8/255)

APGD on reconstruction-based detectors

Novel technique introduced


Recently, detecting AI-generated images produced by diffusion-based models has attracted increasing attention due to their potential threat to safety. Among existing approaches, reconstruction-based methods have emerged as a prominent paradigm for this task. However, we find that such methods exhibit severe security vulnerabilities to adversarial perturbations; that is, by adding imperceptible adversarial perturbations to input images, the detection accuracy of classifiers collapses to near zero. To verify this threat, we present a systematic evaluation of the adversarial robustness of three representative detectors across four diverse generative backbone models. First, we construct adversarial attacks in white-box scenarios, which degrade the performance of all well-trained detectors. Moreover, we find that these attacks demonstrate transferability; specifically, attacks crafted against one detector can be transferred to others, indicating that adversarial attacks on detectors can also be constructed in a black-box setting. Finally, we assess common countermeasures and find that standard defense methods against adversarial attacks provide limited mitigation. We attribute these failures to the low signal-to-noise ratio (SNR) of attacked samples as perceived by the detectors. Overall, our results reveal fundamental security limitations of reconstruction-based detectors and highlight the need to rethink existing detection strategies.


Key Contributions

  • Systematic evaluation showing reconstruction-based detectors for diffusion-generated images are vulnerable to adversarial perturbations with near-zero accuracy
  • Demonstration of transferability of adversarial attacks across different detectors and generative models in black-box settings
  • Analysis revealing standard defenses (diffusion purification, adversarial training) provide limited mitigation due to low SNR of attacked samples

🛡️ Threat Analysis

Input Manipulation Attack

Core contribution is gradient-based adversarial perturbations (APGD) that cause misclassification of AI-generated image detectors at inference time. The paper constructs white-box attacks using imperceptible perturbations that reverse detector predictions, and demonstrates transferability to black-box settings.

Output Integrity Attack

The TARGET of the attack is AI-generated content detection systems (reconstruction-based detectors for diffusion-generated images). While the attack method is ML01, the paper's domain is fundamentally about attacking output integrity verification systems that authenticate whether images are AI-generated.


Details

Domains
visiongenerative
Model Types
diffusioncnn
Threat Tags
white_boxblack_boxinference_timedigital
Datasets
DIRELaRE²AEROBLADE
Applications
ai-generated image detectiondeepfake detectiondiffusion model output verification