attack 2026

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

0 citations

Published on arXiv

2603.05773

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

REA achieves state-of-the-art attack success rates by selectively disabling the refusal execution axis while leaving harmfulness recognition intact, demonstrating a causal 'Knowing without Acting' dissociation.

Refusal Erasure Attack (REA)

Novel technique introduced

Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mechanistic decoupling. We propose the \textbf{\underline{D}}isentangled \textbf{\underline{S}}afety \textbf{\underline{H}}ypothesis \textbf{(DSH)}, positing that safety computation operates on two distinct subspaces: a \textit{Recognition Axis} ($\mathbf{v}_H$, ``Knowing'') and an \textit{Execution Axis} ($\mathbf{v}_R$, ``Acting''). Our geometric analysis reveals a universal ``Reflex-to-Dissociation'' evolution, where these signals transition from antagonistic entanglement in early layers to structural independence in deep layers. To validate this, we introduce \textit{Double-Difference Extraction} and \textit{Adaptive Causal Steering}. Using our curated \textsc{AmbiguityBench}, we demonstrate a causal double dissociation, effectively creating a state of ``Knowing without Acting.'' Crucially, we leverage this disentanglement to propose the \textbf{Refusal Erasure Attack (REA)}, which achieves State-of-the-Art attack success rates by surgically lobotomizing the refusal mechanism. Furthermore, we uncover a critical architectural divergence, contrasting the \textit{Explicit Semantic Control} of Llama3.1 with the \textit{Latent Distributed Control} of Qwen2.5. The code and dataset are available at https://anonymous.4open.science/r/DSH.

Key Contributions

Disentangled Safety Hypothesis (DSH): safety computation operates on two geometrically distinct subspaces — a Recognition Axis (detecting harmfulness) and an Execution Axis (triggering refusal) — with a universal 'Reflex-to-Dissociation' layerwise evolution.
Double-Difference Extraction and Adaptive Causal Steering methods to isolate and manipulate each axis independently, validated by a causal double dissociation on AmbiguityBench.
Refusal Erasure Attack (REA) that surgically targets the Execution Axis to achieve SOTA jailbreak success, plus discovery of architectural divergence between Llama3.1's explicit semantic control and Qwen2.5's latent distributed control.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeted

Datasets

AmbiguityBench

Applications

llm safety alignmentjailbreak attacks

Read PDF arXiv Code

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection

Activation Surgery: Jailbreaking White-box LLMs without Touching the Prompt

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors

ShallowJail: Steering Jailbreaks against Large Language Models

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection

LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback