defense 2025

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

0 citations · 54 references · arXiv

Published on arXiv

2510.18081

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

ADA reduces the average attack success rate of GCG, AutoDAN, PAIR, and TAP to below 3% and achieves near-100% refusal against adversarial prefill attacks across Llama, Gemma, Mistral, Qwen, DeepSeek, and GPT model families.

Any-Depth Alignment (ADA)

Novel technique introduced

Large Language Models (LLMs) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant-prefill attacks). This raises a fundamental question: Can the innate shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the assistant header tokens through repeated use in shallow-refusal training, and these tokens possess the model's strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover refusals at any point in generation. Across diverse open-source model families (Llama, Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety performance without requiring any changes to the base model's parameters. It secures a near-100% refusal rate against challenging adversarial prefill attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces the average success rate of prominent adversarial prompt attacks (such as GCG, AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while preserving utility on benign tasks with minimal over-refusal. ADA maintains this resilience even after the base model undergoes subsequent instruction tuning (benign or adversarial).

Key Contributions

Identifies that LLM alignment is concentrated in assistant header tokens through shallow-refusal training, and that reintroducing these tokens mid-stream can recover safety refusals at any generation depth
Proposes Any-Depth Alignment (ADA), an inference-time defense with negligible overhead that requires no changes to model parameters
Achieves near-100% refusal against adversarial prefill attacks and reduces average success rate of GCG, AutoDAN, PAIR, and TAP to below 3% across six major model families

🛡️ Threat Analysis

Input Manipulation Attack

Defends against gradient-based adversarial attacks (GCG uses gradient-optimized token-level perturbations / adversarial suffixes) — reducing their success rate to below 3%.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxblack_boxinference_time

Datasets

AdvBench

Applications

llm safetychatbot safetyinstruction-following models

Read PDF arXiv DOI

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection

SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

Unifying Adversarial Robustness and Training Across Text Scoring Models

MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

Monotonicity as an Architectural Bias for Robust Language Models

Reinforcement Learning with Backtracking Feedback

CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing

BarrierSteer: LLM Safety via Learning Barrier Steering