Naïve Exposure of Generative AI Capabilities Undermines Deepfake Detection

Generative AI systems increasingly expose powerful reasoning and image refinement capabilities through user-facing chatbot interfaces. In this work, we show that the naïve exposure of such capabilities fundamentally undermines modern deepfake detectors. Rather than proposing a new image manipulation technique, we study a realistic and already-deployed usage scenario in which an adversary uses only benign, policy-compliant prompts and commercial generative AI systems. We demonstrate that state-of-the-art deepfake detection methods fail under semantic-preserving image refinement. Specifically, we show that generative AI systems articulate explicit authenticity criteria and inadvertently externalize them through unrestricted reasoning, enabling their direct reuse as refinement objectives. As a result, refined images simultaneously evade detection, preserve identity as verified by commercial face recognition APIs, and exhibit substantially higher perceptual quality. Importantly, we find that widely accessible commercial chatbot services pose a significantly greater security risk than open-source models, as their superior realism, semantic controllability, and low-barrier interfaces enable effective evasion by non-expert users. Our findings reveal a structural mismatch between the threat models assumed by current detection frameworks and the actual capabilities of real-world generative AI. While detection baselines are largely shaped by prior benchmarks, deployed systems expose unrestricted authenticity reasoning and refinement despite stringent safety controls in other domains.

Key Contributions

Demonstrates that commercial generative AI chatbots expose authenticity reasoning that can be directly reused as refinement objectives to evade deepfake detectors using only policy-compliant prompts
Shows refined deepfakes simultaneously evade state-of-the-art detectors, preserve identity (verified by commercial face recognition APIs), and improve perceptual quality
Reveals a structural mismatch between threat models assumed by current detection frameworks and the actual capabilities naively exposed by deployed commercial AI systems

🛡️ Threat Analysis

Output Integrity Attack

The paper attacks deepfake detection systems — a canonical ML09 content integrity use case — by refining AI-generated images to evade detectors while preserving identity. The contribution is demonstrating that generative AI chatbots undermine content authentication pipelines, which is squarely about output integrity and deepfake detection evasion.

Details

Domains

visiongenerativenlp

Model Types

vlmdiffusionllm

Threat Tags

black_boxinference_timetargeteddigital

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Breaking Semantic-Aware Watermarks via LLM-Guided Coherence-Preserving Semantic Injection

Cryptanalysis of Pseudorandom Error-Correcting Codes

MarkSweep: A No-box Removal Attack on AI-Generated Image Watermarking via Noise Intensification and Frequency-aware Denoising

MarkCleaner: High-Fidelity Watermark Removal via Imperceptible Micro-Geometric Perturbation

Attacks on Approximate Caches in Text-to-Image Diffusion Models

RAVEN: Erasing Invisible Watermarks via Novel View Synthesis

The Coding Limits of Robust Watermarking for Generative Models

DeMark: A Query-Free Black-Box Attack on Deepfake Watermarking Defenses