attack 2025

SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection

Maithili Joshi , Palash Nandi , Tanmoy Chakraborty

0 citations

α

Published on arXiv

2509.16060

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SABER achieves a 51% improvement over the strongest baseline (GCG) on the HarmBench test set while inducing only marginal perplexity shift on the validation set.

SABER (Safety Alignment Bypass via Extra Residuals)

Novel technique introduced


Large Language Models (LLMs) with safe-alignment training are powerful instruments with robust language comprehension capabilities. These models typically undergo meticulous alignment procedures involving human feedback to ensure the acceptance of safe inputs while rejecting harmful or unsafe ones. However, despite their massive scale and alignment efforts, LLMs remain vulnerable to jailbreak attacks, where malicious users manipulate the model to produce harmful outputs that it was explicitly trained to avoid. In this study, we find that the safety mechanisms in LLMs are predominantly embedded in the middle-to-late layers. Building on this insight, we introduce a novel white-box jailbreak method, SABER (Safety Alignment Bypass via Extra Residuals), which connects two intermediate layers $s$ and $e$ such that $s < e$, through a residual connection. Our approach achieves a 51% improvement over the best-performing baseline on the HarmBench test set. Furthermore, SABER induces only a marginal shift in perplexity when evaluated on the HarmBench validation set. The source code is publicly available at https://github.com/PalGitts/SABER.


Key Contributions

  • Empirically identifies that LLM safety mechanisms are predominantly embedded in middle-to-late transformer layers through representational divergence analysis
  • Introduces SABER, a novel white-box jailbreak that short-circuits safety alignment by adding a scaled residual connection between two selected intermediate layers s and e
  • Achieves 51% improvement over GCG (best baseline) on HarmBench test set with minimal perplexity degradation across four LLMs

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetargeted
Datasets
HarmBench
Applications
llm safety alignmentjailbreakingrlhf-aligned language models