Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal
Nirmalendu Prakash 1, Yeo Wei Jie 2, Amir Abdullah 3, Ranjan Satapathy 4, Erik Cambria 2, Roy Ka Wei Lee 1
Published on arXiv
2509.09708
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Ablating a minimal set of SAE latent features identified via greedy filtering and factorization machines causally flips instruction-tuned LLMs from refusal to compliance, exposing redundant dormant safety features as a backup circuit.
SAE Refusal Feature Ablation Pipeline
Novel technique introduced
Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.
Key Contributions
- Three-stage pipeline (Refusal Direction → Greedy Filtering → Interaction Discovery via factorization machines) that identifies minimal sets of SAE latent features causally responsible for refusal
- Demonstration that ablating these jailbreak-critical features reliably flips Gemma-2-2B-IT and LLaMA-3.1-8B-IT from refusal to compliance on harmful prompts
- Discovery of redundant refusal features that remain dormant until earlier features are suppressed, revealing a backup safety circuit