attack 2025

PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors

Sepehr Dehdashtian 1,2, Mashrur M. Morshed 1, Jacob H. Seidman 2, Gaurav Bharaj 1, Vishnu Naresh Boddeti 2

0 citations

α

Published on arXiv

2509.15551

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

PolyJuice-steered T2I models evade state-of-the-art synthetic image detectors with up to 84% success rate, and augmenting SID training data with PolyJuice examples improves detector performance by up to 30%.

PolyJuice

Novel technique introduced


Synthetic image detectors (SIDs) are a key defense against the risks posed by the growing realism of images from text-to-image (T2I) models. Red teaming improves SID's effectiveness by identifying and exploiting their failure modes via misclassified synthetic images. However, existing red-teaming solutions (i) require white-box access to SIDs, which is infeasible for proprietary state-of-the-art detectors, and (ii) generate image-specific attacks through expensive online optimization. To address these limitations, we propose PolyJuice, the first black-box, image-agnostic red-teaming method for SIDs, based on an observed distribution shift in the T2I latent space between samples correctly and incorrectly classified by the SID. PolyJuice generates attacks by (i) identifying the direction of this shift through a lightweight offline process that only requires black-box access to the SID, and (ii) exploiting this direction by universally steering all generated images towards the SID's failure modes. PolyJuice-steered T2I models are significantly more effective at deceiving SIDs (up to 84%) compared to their unsteered counterparts. We also show that the steering directions can be estimated efficiently at lower resolutions and transferred to higher resolutions using simple interpolation, reducing computational overhead. Finally, tuning SID models on PolyJuice-augmented datasets notably enhances the performance of the detectors (up to 30%).


Key Contributions

  • PolyJuice: the first black-box, image-agnostic (universal) red-teaming method for synthetic image detectors, requiring only black-box query access to the SID
  • Identifies a distribution shift in the T2I latent space between correctly and incorrectly classified samples, and uses this direction to universally steer T2I generation toward SID failure modes via lightweight offline optimization
  • Shows that SID models fine-tuned on PolyJuice-augmented datasets improve detector robustness by up to 30%, enabling a defensive use of the attack

🛡️ Threat Analysis

Output Integrity Attack

Synthetic image detectors (SIDs) are ML09 content integrity systems. PolyJuice is an attack that defeats these detectors by exploiting distribution shifts in the T2I latent space to produce AI-generated images that evade detection — directly attacking content authenticity/provenance verification. ML09 explicitly covers both AI-generated content detection systems and attacks that defeat such protections.


Details

Domains
visiongenerative
Model Types
diffusion
Threat Tags
black_boxinference_timeuntargeted
Applications
synthetic image detectionai-generated content detectiondeepfake detection