Towards Transferable Defense Against Malicious Image Edits

Recent approaches employing imperceptible perturbations in input images have demonstrated promising potential to counter malicious manipulations in diffusion-based image editing systems. However, existing methods suffer from limited transferability in cross-model evaluations. To address this, we propose Transferable Defense Against Malicious Image Edits (TDAE), a novel bimodal framework that enhances image immunity against malicious edits through coordinated image-text optimization. Specifically, at the visual defense level, we introduce FlatGrad Defense Mechanism (FDM), which incorporates gradient regularization into the adversarial objective. By explicitly steering the perturbations toward flat minima, FDM amplifies immune robustness against unseen editing models. For textual enhancement protection, we propose an adversarial optimization paradigm named Dynamic Prompt Defense (DPD), which periodically refines text embeddings to align the editing outcomes of immunized images with those of the original images, then updates the images under optimized embeddings. Through iterative adversarial updates to diverse embeddings, DPD enforces the generation of immunized images that seek a broader set of immunity-enhancing features, thereby achieving cross-model transferability. Extensive experimental results demonstrate that our TDAE achieves state-of-the-art performance in mitigating malicious edits under both intra- and cross-model evaluations.

Key Contributions

FlatGrad Defense Mechanism (FDM): gradient regularization that steers adversarial perturbations toward flat minima, improving transferability to unseen editing models
Dynamic Prompt Defense (DPD): adversarial optimization of text embeddings that periodically refines prompt representations to broaden immunity-enhancing features across diverse editing contexts
TDAE bimodal framework combining FDM and DPD achieves state-of-the-art cross-model transferability in blocking malicious diffusion-based image edits

🛡️ Threat Analysis

Output Integrity Attack

The paper creates protective adversarial perturbations embedded in images to prevent malicious AI editing — directly implementing the 'anti-deepfake perturbations / style-transfer protections' category of content integrity defenses that ML09 explicitly references. The primary contribution is a content protection scheme ensuring that diffusion model outputs cannot be maliciously manipulated, i.e., maintaining output/content integrity.

Details

Domains

visiongenerative

Model Types

diffusiontransformer

Threat Tags

white_boxinference_timedigital

Applications

2026 0 cit.

Output Integrity Attack

92%

Towards Transferable Defense Against Malicious Image Edits

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Creating Blank Canvas Against AI-enabled Image Forgery

Now You See It, Now You Don't - Instant Concept Erasure for Safe Text-to-Image and Video Generation

CLUE: Leveraging Low-Rank Adaptation to Capture Latent Uncovered Evidence for Image Forgery Localization

Learning to Watermark in the Latent Space of Generative Models

End4: End-to-end Denoising Diffusion for Diffusion-Based Inpainting Detection

DIA: The Adversarial Exposure of Deterministic Inversion in Diffusion Models

Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing

Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification