defense 2025

Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

Hoang Phan , Victor Li , Qi Lei

New York University

1 citations · 62 references · EMNLP

Published on arXiv

2510.01270

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

PSR reduces attack success rate from 77.5% to 5.9% on Llama-3.1-8B-Instruct and from 89.7% to 5.6% on Llama-3.1-8B base without additional training, while preserving benign task performance

Progressive Self-Reflection (PSR)

Novel technique introduced

Large language models (LLMs) have revolutionized natural language processing with their ability to generate coherent and contextually relevant text. However, their deployment raises significant concerns about the potential for generating harmful or inappropriate content. In this paper, we introduce Progressive Self-Reflection (PSR), a novel inference-time technique that empowers LLMs to self-monitor and correct their outputs dynamically. Experimental results demonstrate that applying our proposed method to Llama-3.1-8B-Instruct reduces the attack success rate from 77.5\% to 5.9\%, to Llama-3.1-8B base from 89.7\% to 5.6\%, and to Qwen2.5-7B-Instruct from 44.4\% to 3.8\%, without additional training, while maintaining their original performance on benign tasks. Our approach acts as a test-time scaling method, where additional self-reflection rounds enhance safety at the cost of inference overhead. To balance safety with computational efficiency, we introduce a lightweight self-reflection predictor that estimates the optimal number of reflection rounds based on input complexity. This adaptive mechanism prevents unnecessary self-assessment on benign inputs while ensuring thorough evaluation when encountering potentially harmful content. Our findings suggest that Progressive Self-Reflection serves as a scalable test-time approach, enhancing LLM safety by dynamically allocating computational resources in proportion to the input's risk profile.

Key Contributions

Progressive Self-Reflection (PSR): a training-free inference-time technique that interleaves generation with periodic self-assessment checkpoints using a binary harmful/harmless classifier on internal activations to trigger backtracking when harmful content is detected
Lightweight MLP predictor that estimates the optimal number of reflection rounds per input based on complexity, avoiding unnecessary overhead on benign inputs
Demonstrated significant ASR reductions across multiple LLMs (e.g., 77.5%→5.9% on Llama-3.1-8B-Instruct, 89.7%→5.6% on Llama-3.1-8B base) without any additional model training

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

CodeChameleon

Applications

large language model safetyjailbreak defenseharmful content prevention

Read PDF arXiv DOI

Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

Defend LLMs Through Self-Consciousness

Adversarial Distilled Retrieval-Augmented Guarding Model for Online Malicious Intent Detection

Securing AI Agents Against Prompt Injection Attacks

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models

ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts

PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features

SecInfer: Preventing Prompt Injection via Inference-time Scaling