defense 2026

ReVision : A Post-Hoc, Vision-Based Technique for Replacing Unacceptable Concepts in Image Generation Pipeline

0 citations · 35 references · arXiv (Cornell University)

Published on arXiv

2602.19149

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Reduces policy-violating content recognizability from 95.99% to 10.16% in human moderation study, with near-complete suppression on NudeNet detector (70.51 → 0) and improved multi-concept background fidelity (LPIPS 0.166 → 0.058).

ReVision

Novel technique introduced

Image-generative models are widely deployed across industries. Recent studies show that they can be exploited to produce policy-violating content. Existing mitigation strategies primarily operate at the pre- or mid-generation stages through techniques such as prompt filtering and safety-aware training/fine-tuning. Prior work shows that these approaches can be bypassed and often degrade generative quality. In this work, we propose ReVision, a training-free, prompt-based, post-hoc safety framework for image-generation pipeline. ReVision acts as a last-line defense by analyzing generated images and selectively editing unsafe concepts without altering the underlying generator. It uses the Gemini-2.5-Flash model as a generic policy-violating concept detector, avoiding reliance on multiple category-specific detectors, and performs localized semantic editing to replace unsafe content. Prior post-hoc editing methods often rely on imprecise spatial localization, that undermines usability and limits deployability, particularly in multi-concept scenes. To address this limitation, ReVision introduces a VLM-assisted spatial gating mechanism that enforces instance-consistent localization, enabling precise edits while preserving scene integrity. We evaluate ReVision on a 245-image benchmark covering both single- and multi-concept scenarios. Results show that ReVision (i) improves CLIP-based alignment toward safe prompts by +$0.121$ on average; (ii) significantly improves multi-concept background fidelity (LPIPS $0.166 \rightarrow 0.058$); (iii) achieves near-complete suppression on category-specific detectors (e.g., NudeNet $70.51 \rightarrow 0$); and (iv) reduces policy-violating content recognizability in a human moderation study from $95.99\%$ to $10.16\%$.

Key Contributions

Training-free, post-hoc safety framework (ReVision) that edits policy-violating concepts in generated images without modifying the underlying generator
VLM-assisted spatial gating mechanism using Gemini-2.5-Flash for instance-consistent localization of unsafe content in multi-concept scenes
Evaluation benchmark of 245 images covering single- and multi-concept policy violation scenarios across categories including nudity, violence, and substance abuse

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses output integrity of generative AI models: detects policy-violating AI-generated content (nudity, violence, etc.) and performs localized semantic editing to replace unsafe outputs, acting as a last-line defense in the generation pipeline. The system is explicitly motivated by adversarial bypass of upstream safety mechanisms that cause unsafe content to reach final outputs.

Details

Domains

visiongenerative

Model Types

diffusionvlm

Threat Tags

inference_timeblack_box

Datasets

custom 245-image benchmark (single- and multi-concept)NudeNet

Applications

text-to-image generationimage-to-image generationcontent moderation pipeline

Read PDF arXiv DOI

ReVision : A Post-Hoc, Vision-Based Technique for Replacing Unacceptable Concepts in Image Generation Pipeline

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation

NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models

SemBind: Binding Diffusion Watermarks to Semantics Against Black-Box Forgery Attacks

WaterVIB: Learning Minimal Sufficient Watermark Representations via Variational Information Bottleneck

Robust Concept Erasure in Diffusion Models: A Theoretical Perspective on Security and Robustness

Seeing It Before It Happens: In-Generation NSFW Detection for Diffusion-Based Text-to-Image Models

GRRE: Leveraging G-Channel Removed Reconstruction Error for Robust Detection of AI-Generated Images

MOLM: Mixture of LoRA Markers