defense 2025

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Thong Bach 1, Thanh Nguyen-Tang 2, Dung Nguyen 1, Thao Minh Le 3, Truyen Tran 1

1 citations · 36 references · arXiv

α

Published on arXiv

2511.18039

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Leveraging shared loss-landscape geometry between base and fine-tuned models, the method selectively penalizes harmful outputs while maintaining or improving task utility across multiple model families and adversarial settings.

Curvature-Aware Alignment Restoration

Novel technique introduced


Fine-tuning Large Language Models (LLMs) for downstream tasks often compromises safety alignment, even when using parameter-efficient methods like LoRA. In this work, we uncover a notable property: fine-tuned models preserve the geometric structure of their loss landscapes concerning harmful content, regardless of the fine-tuning method employed. This suggests that safety behaviors are not erased but shifted to less influential regions of the parameter space. Building on this insight, we propose a curvature-aware alignment restoration method that leverages influence functions and second-order optimization to selectively increase loss on harmful inputs while preserving task performance. By navigating the shared geometry between base and fine-tuned models, our method discourages unsafe outputs while preserving task-relevant performance, avoiding full reversion and enabling precise, low-impact updates. Extensive evaluations across multiple model families and adversarial settings show that our approach efficiently reduces harmful responses while maintaining or even improving utility and few-shot learning performance.


Key Contributions

  • Empirical discovery that fine-tuned LLMs preserve the geometric structure of their loss landscapes for harmful content, indicating safety behaviors are shifted to less influential parameter regions rather than erased
  • Curvature-aware alignment restoration method using influence functions and second-order optimization to selectively increase loss on harmful inputs without reverting task-specific learning
  • Extensive evaluation across multiple LLM families and adversarial settings demonstrating reduced harmful outputs while preserving or improving utility and few-shot performance

🛡️ Threat Analysis

Transfer Learning Attack

The paper's primary threat model is safety alignment degradation that occurs specifically during the fine-tuning/transfer learning process (including LoRA). The defense targets the geometric gap between pre-trained and fine-tuned model parameter spaces, making this squarely a transfer learning attack defense rather than a general alignment issue.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxtraining_time
Applications
llm fine-tuningsafety alignmentparameter-efficient fine-tuning