defense arXiv Nov 22, 2025 · Nov 2025
Thong Bach, Thanh Nguyen-Tang, Dung Nguyen et al. · Deakin University · New Jersey Institute of Technology +1 more
Restores LLM safety alignment after fine-tuning by exploiting shared loss-landscape geometry with curvature-aware second-order optimization
Transfer Learning Attack Prompt Injection nlp
Fine-tuning Large Language Models (LLMs) for downstream tasks often compromises safety alignment, even when using parameter-efficient methods like LoRA. In this work, we uncover a notable property: fine-tuned models preserve the geometric structure of their loss landscapes concerning harmful content, regardless of the fine-tuning method employed. This suggests that safety behaviors are not erased but shifted to less influential regions of the parameter space. Building on this insight, we propose a curvature-aware alignment restoration method that leverages influence functions and second-order optimization to selectively increase loss on harmful inputs while preserving task performance. By navigating the shared geometry between base and fine-tuned models, our method discourages unsafe outputs while preserving task-relevant performance, avoiding full reversion and enabling precise, low-impact updates. Extensive evaluations across multiple model families and adversarial settings show that our approach efficiently reduces harmful responses while maintaining or even improving utility and few-shot learning performance.
llm transformer Deakin University · New Jersey Institute of Technology · Pennsylvania State University
defense arXiv Nov 15, 2025 · Nov 2025
Thong Bach, Dung Nguyen, Thao Minh Le et al. · Deakin University · Pennsylvania State University
Defends LLMs against jailbreaks by fixing gradient-decay-induced incomplete safety alignment via base-favored token penalties and teacher distillation
Input Manipulation Attack Prompt Injection nlp
Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment. We provide a mechanistic analysis revealing that position-dependent gradient weakening during autoregressive training creates signal decay, leading to incomplete safety learning where safety training fails to transform model preferences in later response regions fully. We introduce base-favored tokens -- vocabulary elements where base models assign higher probability than aligned models -- as computational indicators of incomplete safety learning and develop a targeted completion method that addresses undertrained regions through adaptive penalties and hybrid teacher distillation. Experimental evaluation across Llama and Qwen model families demonstrates dramatic improvements in adversarial robustness, with 48--98% reductions in attack success rates while preserving general capabilities. These results establish both a mechanistic understanding and practical solutions for fundamental limitations in safety alignment methodologies.
llm transformer Deakin University · Pennsylvania State University