PromptFlare: Prompt-Generalized Defense via Cross-Attention Decoy in Diffusion-Based Inpainting
Hohyun Na , Seunghoo Hong , Simon S. Woo
Published on arXiv
2508.16217
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieves state-of-the-art protection against diffusion-based inpainting across multiple metrics on EditBench while significantly reducing GPU memory usage compared to prior image immunization methods.
PromptFlare
Novel technique introduced
The success of diffusion models has enabled effortless, high-quality image modifications that precisely align with users' intentions, thereby raising concerns about their potential misuse by malicious actors. Previous studies have attempted to mitigate such misuse through adversarial attacks. However, these approaches heavily rely on image-level inconsistencies, which pose fundamental limitations in addressing the influence of textual prompts. In this paper, we propose PromptFlare, a novel adversarial protection method designed to protect images from malicious modifications facilitated by diffusion-based inpainting models. Our approach leverages the cross-attention mechanism to exploit the intrinsic properties of prompt embeddings. Specifically, we identify and target shared token of prompts that is invariant and semantically uninformative, injecting adversarial noise to suppress the sampling process. The injected noise acts as a cross-attention decoy, diverting the model's focus away from meaningful prompt-image alignments and thereby neutralizing the effect of prompt. Extensive experiments on the EditBench dataset demonstrate that our method achieves state-of-the-art performance across various metrics while significantly reducing computational overhead and GPU memory usage. These findings highlight PromptFlare as a robust and efficient protection against unauthorized image manipulations. The code is available at https://github.com/NAHOHYUN-SKKU/PromptFlare.
Key Contributions
- Identifies prompt-invariant shared tokens (e.g., padding/uninformative tokens) in diffusion model cross-attention that can be universally targeted regardless of the user's text prompt
- Injects adversarial noise that acts as a cross-attention decoy, suppressing meaningful prompt-image alignment in diffusion-based inpainting models
- Achieves state-of-the-art image protection on EditBench with reduced GPU memory and computational overhead compared to prior adversarial image protection methods
🛡️ Threat Analysis
PromptFlare is an image protection scheme that adds adversarial perturbations to images to prevent unauthorized AI-generated modifications — directly analogous to anti-deepfake perturbations and style-transfer protections listed under ML09. The primary goal is content integrity: stopping malicious actors from using diffusion inpainting to alter protected images.