attack 2025

PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors

Sepehr Dehdashtian ^1,2, Mashrur M. Morshed ¹, Jacob H. Seidman ², Gaurav Bharaj ¹, Vishnu Naresh Boddeti ²

¹ Michigan State University

² Reality Defender

0 citations

Published on arXiv

2509.15551

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

PolyJuice-steered T2I models evade state-of-the-art synthetic image detectors with up to 84% success rate, and augmenting SID training data with PolyJuice examples improves detector performance by up to 30%.

PolyJuice

Novel technique introduced

Synthetic image detectors (SIDs) are a key defense against the risks posed by the growing realism of images from text-to-image (T2I) models. Red teaming improves SID's effectiveness by identifying and exploiting their failure modes via misclassified synthetic images. However, existing red-teaming solutions (i) require white-box access to SIDs, which is infeasible for proprietary state-of-the-art detectors, and (ii) generate image-specific attacks through expensive online optimization. To address these limitations, we propose PolyJuice, the first black-box, image-agnostic red-teaming method for SIDs, based on an observed distribution shift in the T2I latent space between samples correctly and incorrectly classified by the SID. PolyJuice generates attacks by (i) identifying the direction of this shift through a lightweight offline process that only requires black-box access to the SID, and (ii) exploiting this direction by universally steering all generated images towards the SID's failure modes. PolyJuice-steered T2I models are significantly more effective at deceiving SIDs (up to 84%) compared to their unsteered counterparts. We also show that the steering directions can be estimated efficiently at lower resolutions and transferred to higher resolutions using simple interpolation, reducing computational overhead. Finally, tuning SID models on PolyJuice-augmented datasets notably enhances the performance of the detectors (up to 30%).

Key Contributions

PolyJuice: the first black-box, image-agnostic (universal) red-teaming method for synthetic image detectors, requiring only black-box query access to the SID
Identifies a distribution shift in the T2I latent space between correctly and incorrectly classified samples, and uses this direction to universally steer T2I generation toward SID failure modes via lightweight offline optimization
Shows that SID models fine-tuned on PolyJuice-augmented datasets improve detector robustness by up to 30%, enabling a defensive use of the attack

🛡️ Threat Analysis

Output Integrity Attack

Synthetic image detectors (SIDs) are ML09 content integrity systems. PolyJuice is an attack that defeats these detectors by exploiting distribution shifts in the T2I latent space to produce AI-generated images that evade detection — directly attacking content authenticity/provenance verification. ML09 explicitly covers both AI-generated content detection systems and attacks that defeat such protections.

Details

Domains

visiongenerative

Model Types

diffusion

Threat Tags

black_boxinference_timeuntargeted

Applications

synthetic image detectionai-generated content detectiondeepfake detection

Read PDF arXiv

PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SHIFT: Stochastic Hidden-Trajectory Deflection for Removing Diffusion-based Watermark

Identifying Models Behind Text-to-Image Leaderboards

MarkCleaner: High-Fidelity Watermark Removal via Imperceptible Micro-Geometric Perturbation

Understanding Semantic Perturbations on In-Processing Generative Image Watermarks

First-Place Solution to NeurIPS 2024 Invisible Watermark Removal Challenge

D2RA: Dual Domain Regeneration Attack

The Coding Limits of Robust Watermarking for Generative Models

Attacks on Approximate Caches in Text-to-Image Diffusion Models