ReVision : A Post-Hoc, Vision-Based Technique for Replacing Unacceptable Concepts in Image Generation Pipeline
Gurjot Singh 1, Prabhjot Singh 2, Aashima Sharma 3, Maninder Singh 3, Ryan Ko 4
Published on arXiv
2602.19149
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Reduces policy-violating content recognizability from 95.99% to 10.16% in human moderation study, with near-complete suppression on NudeNet detector (70.51 → 0) and improved multi-concept background fidelity (LPIPS 0.166 → 0.058).
ReVision
Novel technique introduced
Image-generative models are widely deployed across industries. Recent studies show that they can be exploited to produce policy-violating content. Existing mitigation strategies primarily operate at the pre- or mid-generation stages through techniques such as prompt filtering and safety-aware training/fine-tuning. Prior work shows that these approaches can be bypassed and often degrade generative quality. In this work, we propose ReVision, a training-free, prompt-based, post-hoc safety framework for image-generation pipeline. ReVision acts as a last-line defense by analyzing generated images and selectively editing unsafe concepts without altering the underlying generator. It uses the Gemini-2.5-Flash model as a generic policy-violating concept detector, avoiding reliance on multiple category-specific detectors, and performs localized semantic editing to replace unsafe content. Prior post-hoc editing methods often rely on imprecise spatial localization, that undermines usability and limits deployability, particularly in multi-concept scenes. To address this limitation, ReVision introduces a VLM-assisted spatial gating mechanism that enforces instance-consistent localization, enabling precise edits while preserving scene integrity. We evaluate ReVision on a 245-image benchmark covering both single- and multi-concept scenarios. Results show that ReVision (i) improves CLIP-based alignment toward safe prompts by +$0.121$ on average; (ii) significantly improves multi-concept background fidelity (LPIPS $0.166 \rightarrow 0.058$); (iii) achieves near-complete suppression on category-specific detectors (e.g., NudeNet $70.51 \rightarrow 0$); and (iv) reduces policy-violating content recognizability in a human moderation study from $95.99\%$ to $10.16\%$.
Key Contributions
- Training-free, post-hoc safety framework (ReVision) that edits policy-violating concepts in generated images without modifying the underlying generator
- VLM-assisted spatial gating mechanism using Gemini-2.5-Flash for instance-consistent localization of unsafe content in multi-concept scenes
- Evaluation benchmark of 245 images covering single- and multi-concept policy violation scenarios across categories including nudity, violence, and substance abuse
🛡️ Threat Analysis
Directly addresses output integrity of generative AI models: detects policy-violating AI-generated content (nudity, violence, etc.) and performs localized semantic editing to replace unsafe outputs, acting as a last-line defense in the generation pipeline. The system is explicitly motivated by adversarial bypass of upstream safety mechanisms that cause unsafe content to reach final outputs.