SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection
Maithili Joshi , Palash Nandi , Tanmoy Chakraborty
Published on arXiv
2509.16060
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
SABER achieves a 51% improvement over the strongest baseline (GCG) on the HarmBench test set while inducing only marginal perplexity shift on the validation set.
SABER (Safety Alignment Bypass via Extra Residuals)
Novel technique introduced
Large Language Models (LLMs) with safe-alignment training are powerful instruments with robust language comprehension capabilities. These models typically undergo meticulous alignment procedures involving human feedback to ensure the acceptance of safe inputs while rejecting harmful or unsafe ones. However, despite their massive scale and alignment efforts, LLMs remain vulnerable to jailbreak attacks, where malicious users manipulate the model to produce harmful outputs that it was explicitly trained to avoid. In this study, we find that the safety mechanisms in LLMs are predominantly embedded in the middle-to-late layers. Building on this insight, we introduce a novel white-box jailbreak method, SABER (Safety Alignment Bypass via Extra Residuals), which connects two intermediate layers $s$ and $e$ such that $s < e$, through a residual connection. Our approach achieves a 51% improvement over the best-performing baseline on the HarmBench test set. Furthermore, SABER induces only a marginal shift in perplexity when evaluated on the HarmBench validation set. The source code is publicly available at https://github.com/PalGitts/SABER.
Key Contributions
- Empirically identifies that LLM safety mechanisms are predominantly embedded in middle-to-late transformer layers through representational divergence analysis
- Introduces SABER, a novel white-box jailbreak that short-circuits safety alignment by adding a scaled residual connection between two selected intermediate layers s and e
- Achieves 51% improvement over GCG (best baseline) on HarmBench test set with minimal perplexity degradation across four LLMs