Immunizing Images from Text to Image Editing via Adversarial Cross-Attention

Recent advances in text-based image editing have enabled fine-grained manipulation of visual content guided by natural language. However, such methods are susceptible to adversarial attacks. In this work, we propose a novel attack that targets the visual component of editing methods. We introduce Attention Attack, which disrupts the cross-attention between a textual prompt and the visual representation of the image by using an automatically generated caption of the source image as a proxy for the edit prompt. This breaks the alignment between the contents of the image and their textual description, without requiring knowledge of the editing method or the editing prompt. Reflecting on the reliability of existing metrics for immunization success, we propose two novel evaluation strategies: Caption Similarity, which quantifies semantic consistency between original and adversarial edits, and semantic Intersection over Union (IoU), which measures spatial layout disruption via segmentation masks. Experiments conducted on the TEDBench++ benchmark demonstrate that our attack significantly degrades editing performance while remaining imperceptible.

Key Contributions

Attention Attack: a prompt-agnostic adversarial perturbation method that uses an auto-generated image caption (via LLaVa) as a surrogate edit prompt to disrupt cross-attention between text and visual features in diffusion-based editors
Two novel evaluation strategies for immunization success: Caption Similarity (semantic consistency of edits) and semantic IoU (spatial layout disruption via segmentation masks)
Benchmark evaluation on TEDBench++ showing significant degradation of text-guided editing while maintaining imperceptibility

🛡️ Threat Analysis

Input Manipulation Attack

Proposes gradient-based adversarial noise crafted to maximally disrupt the cross-attention mechanism in text-to-image editing diffusion models at inference time, causing the editing pipeline to fail. This is a canonical input manipulation attack — imperceptible perturbations are added to inputs to corrupt model behavior, with the twist that the goal is protective immunization rather than misclassification.

Details

Domains

visiongenerative

Model Types

diffusiontransformer

Threat Tags

white_boxinference_timetargeteddigital

Datasets

TEDBench++

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack

When World Models Dream Wrong: Physical-Conditioned Adversarial Attacks against World Models

The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization

DeContext as Defense: Safe Image Editing in Diffusion Transformers

Revoking Amnesia: RL-based Trajectory Optimization to Resurrect Erased Concepts in Diffusion Models

LURE: Latent Space Unblocking for Multi-Concept Reawakening in Diffusion Models

Arc2Morph: Identity-Preserving Facial Morphing with Arc2Face

Certified but Fooled! Breaking Certified Defences with Ghost Certificates