SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection

Large Language Models (LLMs) with safe-alignment training are powerful instruments with robust language comprehension capabilities. These models typically undergo meticulous alignment procedures involving human feedback to ensure the acceptance of safe inputs while rejecting harmful or unsafe ones. However, despite their massive scale and alignment efforts, LLMs remain vulnerable to jailbreak attacks, where malicious users manipulate the model to produce harmful outputs that it was explicitly trained to avoid. In this study, we find that the safety mechanisms in LLMs are predominantly embedded in the middle-to-late layers. Building on this insight, we introduce a novel white-box jailbreak method, SABER (Safety Alignment Bypass via Extra Residuals), which connects two intermediate layers $s$ and $e$ such that $s < e$, through a residual connection. Our approach achieves a 51% improvement over the best-performing baseline on the HarmBench test set. Furthermore, SABER induces only a marginal shift in perplexity when evaluated on the HarmBench validation set. The source code is publicly available at https://github.com/PalGitts/SABER.

Key Contributions

Empirically identifies that LLM safety mechanisms are predominantly embedded in middle-to-late transformer layers through representational divergence analysis
Introduces SABER, a novel white-box jailbreak that short-circuits safety alignment by adding a scaled residual connection between two selected intermediate layers s and e
Achieves 51% improvement over GCG (best baseline) on HarmBench test set with minimal perplexity degradation across four LLMs

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeted

Datasets

HarmBench

Applications

llm safety alignmentjailbreakingrlhf-aligned language models

2026 0 cit.

100%