benchmark 2025

Unveiling the Latent Directions of Reflection in Large Language Models

Fu-Chieh Chang ^1,2, Yu-Ting Lee ², Pei-Yuan Wu ^2,2

¹ MediaTek Research

² National Taiwan University

0 citations

Published on arXiv

2508.16989

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Activation steering interventions confirm controllable suppression and enhancement of LLM reflection, with suppression substantially easier, revealing an asymmetric vulnerability exploitable in jailbreak attacks on Qwen2.5-3B and Gemma3-4B-IT

Reflection Activation Steering

Novel technique introduced

Reflection, the ability of large language models (LLMs) to evaluate and revise their own reasoning, has been widely used to improve performance on complex reasoning tasks. Yet, most prior works emphasizes designing reflective prompting strategies or reinforcement learning objectives, leaving the inner mechanisms of reflection underexplored. In this paper, we investigate reflection through the lens of latent directions in model activations. We propose a methodology based on activation steering to characterize how instructions with different reflective intentions: no reflection, intrinsic reflection, and triggered reflection. By constructing steering vectors between these reflection levels, we demonstrate that (1) new reflection-inducing instructions can be systematically identified, (2) reflective behavior can be directly enhanced or suppressed through activation interventions, and (3) suppressing reflection is considerably easier than stimulating it. Experiments on GSM8k-adv and Cruxeval-o-adv with Qwen2.5-3B and Gemma3-4B-IT reveal clear stratification across reflection levels, and steering interventions confirm the controllability of reflection. Our findings highlight both opportunities (e.g., reflection-enhancing defenses) and risks (e.g., adversarial inhibition of reflection in jailbreak attacks). This work opens a path toward mechanistic understanding of reflective reasoning in LLMs.

Key Contributions

Methodology using activation steering vectors to characterize three reflection levels (no reflection, intrinsic, triggered) in LLM activations
Empirical finding that suppressing reflection is considerably easier than stimulating it, creating an asymmetric adversarial risk
Demonstration that reflective behavior can be directly controlled via activation interventions, with clear stratification across reflection levels on GSM8k-adv and Cruxeval-o-adv

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_time

Datasets

GSM8k-advCruxeval-o-adv

Applications

llm reasoningjailbreak defensechain-of-thought safety

Read PDF arXiv

Unveiling the Latent Directions of Reflection in Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

Towards mitigating information leakage when evaluating safety monitors

Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks

What Matters For Safety Alignment?

Silenced Biases: The Dark Side LLMs Learned to Refuse

Defenses Against Prompt Attacks Learn Surface Heuristics

SecureBreak -- A dataset towards safe and secure models