defense 2026

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Yannick Assogba ¹, Jacopo Cortellazzi ¹, Javier Abad ², Pau Rodriguez ¹, Xavier Suau ¹, Arno Blaas ¹

¹ Apple

² ETH Zurich

0 citations · 70 references · arXiv (Cornell University)

Published on arXiv

2602.12418

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

CC-Delta achieves comparable or better safety-utility tradeoffs than dense latent space baselines across four aligned LLMs and twelve jailbreak attacks, with particular gains against out-of-distribution attacks.

CC-Delta (Context-Conditioned Delta Steering)

Novel technique introduced

Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and twelve jailbreak attacks, CC-Delta achieves comparable or better safety-utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation. Our results suggest off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training.

Key Contributions

Context-Conditioned Delta Steering (CC-Delta): identifies jailbreak-relevant SAE features by contrasting token-level representations of harmful requests with vs. without jailbreak context using statistical testing.
Demonstrates that steering in sparse SAE feature space outperforms dense activation steering on all four aligned instruction-tuned models, especially against out-of-distribution jailbreak attacks.
Shows that off-the-shelf interpretability SAEs can be repurposed as practical jailbreak defenses without any task-specific retraining.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

AdvBench

Applications

llm safetyjailbreak defensealigned instruction-tuned models

Read PDF arXiv DOI

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance

Proactive defense against LLM Jailbreak

RedVisor: Reasoning-Aware Prompt Injection Defense via Zero-Copy KV Cache Reuse

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks

From static to adaptive: immune memory-based jailbreak detection for large language models

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Defend LLMs Through Self-Consciousness