defense 2025

Rethinking Deep Alignment Through The Lens Of Incomplete Learning

Thong Bach 1, Dung Nguyen 1, Thao Minh Le 2, Truyen Tran 1

0 citations · 43 references · arXiv

α

Published on arXiv

2511.12155

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Targeted completion achieves 48–98% reductions in adversarial attack success rates across Llama and Qwen model families while preserving general capabilities

Targeted Completion with Base-Favored Tokens

Novel technique introduced


Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment. We provide a mechanistic analysis revealing that position-dependent gradient weakening during autoregressive training creates signal decay, leading to incomplete safety learning where safety training fails to transform model preferences in later response regions fully. We introduce base-favored tokens -- vocabulary elements where base models assign higher probability than aligned models -- as computational indicators of incomplete safety learning and develop a targeted completion method that addresses undertrained regions through adaptive penalties and hybrid teacher distillation. Experimental evaluation across Llama and Qwen model families demonstrates dramatic improvements in adversarial robustness, with 48--98% reductions in attack success rates while preserving general capabilities. These results establish both a mechanistic understanding and practical solutions for fundamental limitations in safety alignment methodologies.


Key Contributions

  • Mechanistic analysis establishing that position-dependent gradient decay in autoregressive training causes shallow safety alignment — identifying the root cause, not just the symptom
  • Base-favored tokens (vocabulary positions where base model probability exceeds aligned model probability) as fine-grained computational indicators of incomplete distributional alignment across response regions
  • Targeted completion framework using adaptive penalties on base-favored tokens and hybrid teacher distillation, achieving 48–98% reduction in adversarial attack success rates across Llama and Qwen families without expensive retraining

🛡️ Threat Analysis

Input Manipulation Attack

The defense is explicitly evaluated against gradient-based adversarial suffix attacks (optimization-based attacks like GCG/AutoDAN), which are adversarial suffix optimization on LLMs — a core ML01 threat.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timewhite_boxblack_box
Datasets
AdvBenchJailbreakBench
Applications
llm safety alignmentinstruction-following modelschatbot