defense 2026

ReVision : A Post-Hoc, Vision-Based Technique for Replacing Unacceptable Concepts in Image Generation Pipeline

Gurjot Singh 1, Prabhjot Singh 2, Aashima Sharma 3, Maninder Singh 3, Ryan Ko 4

0 citations · 35 references · arXiv (Cornell University)

α

Published on arXiv

2602.19149

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Reduces policy-violating content recognizability from 95.99% to 10.16% in human moderation study, with near-complete suppression on NudeNet detector (70.51 → 0) and improved multi-concept background fidelity (LPIPS 0.166 → 0.058).

ReVision

Novel technique introduced


Image-generative models are widely deployed across industries. Recent studies show that they can be exploited to produce policy-violating content. Existing mitigation strategies primarily operate at the pre- or mid-generation stages through techniques such as prompt filtering and safety-aware training/fine-tuning. Prior work shows that these approaches can be bypassed and often degrade generative quality. In this work, we propose ReVision, a training-free, prompt-based, post-hoc safety framework for image-generation pipeline. ReVision acts as a last-line defense by analyzing generated images and selectively editing unsafe concepts without altering the underlying generator. It uses the Gemini-2.5-Flash model as a generic policy-violating concept detector, avoiding reliance on multiple category-specific detectors, and performs localized semantic editing to replace unsafe content. Prior post-hoc editing methods often rely on imprecise spatial localization, that undermines usability and limits deployability, particularly in multi-concept scenes. To address this limitation, ReVision introduces a VLM-assisted spatial gating mechanism that enforces instance-consistent localization, enabling precise edits while preserving scene integrity. We evaluate ReVision on a 245-image benchmark covering both single- and multi-concept scenarios. Results show that ReVision (i) improves CLIP-based alignment toward safe prompts by +$0.121$ on average; (ii) significantly improves multi-concept background fidelity (LPIPS $0.166 \rightarrow 0.058$); (iii) achieves near-complete suppression on category-specific detectors (e.g., NudeNet $70.51 \rightarrow 0$); and (iv) reduces policy-violating content recognizability in a human moderation study from $95.99\%$ to $10.16\%$.


Key Contributions

  • Training-free, post-hoc safety framework (ReVision) that edits policy-violating concepts in generated images without modifying the underlying generator
  • VLM-assisted spatial gating mechanism using Gemini-2.5-Flash for instance-consistent localization of unsafe content in multi-concept scenes
  • Evaluation benchmark of 245 images covering single- and multi-concept policy violation scenarios across categories including nudity, violence, and substance abuse

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses output integrity of generative AI models: detects policy-violating AI-generated content (nudity, violence, etc.) and performs localized semantic editing to replace unsafe outputs, acting as a last-line defense in the generation pipeline. The system is explicitly motivated by adversarial bypass of upstream safety mechanisms that cause unsafe content to reach final outputs.


Details

Domains
visiongenerative
Model Types
diffusionvlm
Threat Tags
inference_timeblack_box
Datasets
custom 245-image benchmark (single- and multi-concept)NudeNet
Applications
text-to-image generationimage-to-image generationcontent moderation pipeline