defense 2025

Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

Fenghua Weng ^1,2, Chaochao Lu ², Xia Hu ², Wenqi Shao ², Wenjie Wang ¹

¹ ShanghaiTech University

² Shanghai Artificial Intelligence Laboratory

1 citations · 44 references · arXiv

Published on arXiv

2512.07141

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

TRR improves the safe response rate of Qwen2.5-VL-7B from 42.8% to 87.7% on jailbreak attack evaluations while preserving stable performance on general multimodal benchmarks.

Think-Reflect-Revise (TRR)

Novel technique introduced

As multimodal reasoning improves the overall capabilities of Large Vision Language Models (LVLMs), recent studies have begun to explore safety-oriented reasoning, aiming to enhance safety awareness by analyzing potential safety risks during the reasoning process before generating the final response. Although such approaches improve safety awareness and interpretability, this single-pass think-then-answer paradigm remains vulnerable to contextual or visual jailbreak attacks. This reveals a critical flaw: single-pass reasoning may overlook explicit harmful content in its own output. Our key insight is to exploit this wasted signal through reflection, which can effectively leverage the malicious content revealed in the first-pass reasoning to enable genuine self-correction and prevent unsafe generations. Motivated by this, we propose Think-Reflect-Revise (TRR), a three-stage training framework designed to enhance the safety alignment of LVLMs through policy-guided self-reflection. We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process. We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning. Experimental results show that TRR substantially improves the safety performance of LVLMs across both safety-awareness benchmarks and jailbreak attack evaluations, increasing the overall safe response rate from 42.8% to 87.7% on Qwen2.5-VL-7B, while preserving stable performance on general benchmarks such as MMMU and MMStar. The project page is available at https://think-reflect-revise.github.io/.

Key Contributions

TRR: a three-stage training framework (supervised fine-tuning + RL) enabling LVLMs to reflect on their own first-pass reasoning and self-correct unsafe outputs before final response generation
ReSafe: a 5,000-example dataset following a think-reflect-revise annotation process for bootstrapping reflective safety behavior
Demonstrated improvement in safe response rate from 42.8% to 87.7% on Qwen2.5-VL-7B across jailbreak benchmarks while preserving general capability on MMMU and MMStar

🛡️ Threat Analysis

Details

Domains

multimodalnlp

Model Types

vlmllm

Threat Tags

inference_timetraining_time

Datasets

ReSafeMMMUMMStar

Applications

large vision language modelsmultimodal chatbotsvisual question answering safety

Read PDF arXiv DOI Code

Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Omni-Safety under Cross-Modality Conflict: Vulnerabilities, Dynamics Mechanisms and Efficient Alignment

Relationship-Aware Safety Unlearning for Multimodal LLMs

Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution

Multi-turn Jailbreaking Attack in Multi-Modal Large Language Models

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

Clouding the Mirror: Stealthy Prompt Injection Attacks Targeting LLM-based Phishing Detection

SafeNeuron: Neuron-Level Safety Alignment for Large Language Models