defense 2026

Towards Understanding the Robustness of Sparse Autoencoders

Ahson Saiyed , Sabrina Sadiekh , Chirag Agarwal

0 citations

α

Published on arXiv

2604.18756

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves up to 5x reduction in jailbreak success rate against GCG and BEAST attacks while reducing cross-model transferability

SAE-augmented inference

Novel technique introduced


Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, their robustness implications remain underexplored. We present a study of integrating pretrained SAEs into transformer residual streams at inference time, without modifying model weights or blocking gradients. Across four model families (Gemma, LLaMA, Mistral, Qwen) and two strong white-box attacks (GCG, BEAST) plus three black-box benchmarks, SAE-augmented models achieve up to a 5x reduction in jailbreak success rate relative to the undefended baseline and reduce cross-model attack transferability. Parametric ablations reveal (i) a monotonic dose-response relationship between L0 sparsity and attack success rate, and (ii) a layer-dependent defense-utility tradeoff, where intermediate layers balance robustness and clean performance. These findings are consistent with a representational bottleneck hypothesis: sparse projection reshapes the optimization geometry exploited by jailbreak attacks.


Key Contributions

  • Demonstrates SAEs as inference-time jailbreak defense without weight modification or gradient blocking
  • Identifies monotonic relationship between L0 sparsity and attack success rate across four model families
  • Reveals layer-dependent defense-utility tradeoff with intermediate layers providing optimal robustness-performance balance

🛡️ Threat Analysis

Input Manipulation Attack

Defends against optimization-based adversarial attacks (GCG, BEAST) that use gradients to craft jailbreak inputs at inference time.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxblack_boxinference_time
Datasets
GemmaLLaMAMistralQwen
Applications
llm safetyjailbreak defense