attack 2025

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

0 citations

Published on arXiv

2509.09708

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Ablating a minimal set of SAE latent features identified via greedy filtering and factorization machines causally flips instruction-tuned LLMs from refusal to compliance, exposing redundant dormant safety features as a backup circuit.

SAE Refusal Feature Ablation Pipeline

Novel technique introduced

Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.

Key Contributions

Three-stage pipeline (Refusal Direction → Greedy Filtering → Interaction Discovery via factorization machines) that identifies minimal sets of SAE latent features causally responsible for refusal
Demonstration that ablating these jailbreak-critical features reliably flips Gemma-2-2B-IT and LLaMA-3.1-8B-IT from refusal to compliance on harmful prompts
Discovery of redundant refusal features that remain dormant until earlier features are suppressed, revealing a backup safety circuit

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_time

Applications

llm safety alignmentinstruction-tuned language models

Read PDF arXiv

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

On the Failure of Topic-Matched Contrast Baselines in Multi-Directional Refusal Abliteration

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

The Rogue Scalpel: Activation Steering Compromises LLM Safety

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection