attack 2026

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

Jinman Wu 1, Yi Xie 2, Shen Lin 3, Shiqian Zhao 4, Xiaofeng Chen 1

0 citations

α

Published on arXiv

2603.05773

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

REA achieves state-of-the-art attack success rates by selectively disabling the refusal execution axis while leaving harmfulness recognition intact, demonstrating a causal 'Knowing without Acting' dissociation.

Refusal Erasure Attack (REA)

Novel technique introduced


Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mechanistic decoupling. We propose the \textbf{\underline{D}}isentangled \textbf{\underline{S}}afety \textbf{\underline{H}}ypothesis \textbf{(DSH)}, positing that safety computation operates on two distinct subspaces: a \textit{Recognition Axis} ($\mathbf{v}_H$, ``Knowing'') and an \textit{Execution Axis} ($\mathbf{v}_R$, ``Acting''). Our geometric analysis reveals a universal ``Reflex-to-Dissociation'' evolution, where these signals transition from antagonistic entanglement in early layers to structural independence in deep layers. To validate this, we introduce \textit{Double-Difference Extraction} and \textit{Adaptive Causal Steering}. Using our curated \textsc{AmbiguityBench}, we demonstrate a causal double dissociation, effectively creating a state of ``Knowing without Acting.'' Crucially, we leverage this disentanglement to propose the \textbf{Refusal Erasure Attack (REA)}, which achieves State-of-the-Art attack success rates by surgically lobotomizing the refusal mechanism. Furthermore, we uncover a critical architectural divergence, contrasting the \textit{Explicit Semantic Control} of Llama3.1 with the \textit{Latent Distributed Control} of Qwen2.5. The code and dataset are available at https://anonymous.4open.science/r/DSH.


Key Contributions

  • Disentangled Safety Hypothesis (DSH): safety computation operates on two geometrically distinct subspaces — a Recognition Axis (detecting harmfulness) and an Execution Axis (triggering refusal) — with a universal 'Reflex-to-Dissociation' layerwise evolution.
  • Double-Difference Extraction and Adaptive Causal Steering methods to isolate and manipulate each axis independently, validated by a causal double dissociation on AmbiguityBench.
  • Refusal Erasure Attack (REA) that surgically targets the Execution Axis to achieve SOTA jailbreak success, plus discovery of architectural divergence between Llama3.1's explicit semantic control and Qwen2.5's latent distributed control.

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetargeted
Datasets
AmbiguityBench
Applications
llm safety alignmentjailbreak attacks