defense 2026

Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

Jiawen Zhang 1, Lipeng He 1, Kejia Chen 2, Jian Lou 1, Jian Liu 1, Xiaohu Yang 3, Ruoxi Jia 4

1 citations · 66 references · arXiv

α

Published on arXiv

2601.01887

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Safety alignment is fully recoverable with a single safety example within a few epochs, without degrading model utility, regardless of how many harmful examples were used in fine-tuning.


Fine-tuning safety-aligned large language models (LLMs) can substantially compromise their safety. Previous approaches require many safety samples or calibration sets, which not only incur significant computational overhead during realignment but also lead to noticeable degradation in model utility. Contrary to this belief, we show that safety alignment can be fully recovered with only a single safety example, without sacrificing utility and at minimal cost. Remarkably, this recovery is effective regardless of the number of harmful examples used in fine-tuning or the size of the underlying model, and convergence is achieved within just a few epochs. Furthermore, we uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible. We validate our findings across five safety-aligned LLMs and multiple datasets, demonstrating the generality of our approach.


Key Contributions

  • Demonstrates that LLM safety alignment can be fully recovered using only a single safety example, regardless of the number of harmful fine-tuning examples or model size
  • Uncovers the low-rank structure of the safety gradient, providing a theoretical explanation for why one-shot realignment is effective
  • Validates the approach across five safety-aligned LLMs and multiple datasets with convergence in just a few epochs

🛡️ Threat Analysis

Transfer Learning Attack

The threat model is explicitly fine-tuning (transfer learning) on harmful data that degrades safety alignment; the paper proposes a defense that exploits low-rank safety gradient structure to recover alignment, directly addressing the fine-tuning attack vector.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timewhite_box
Applications
safety-aligned llm fine-tuningllm safety realignment