defense 2026

Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink

Guozhi Liu 1, Weiwei Lin 1,2, Tiansheng Huang 1, Ruichao Mo 1, Qi Mu 1, Xiumin Wang 1, Li Shen 3

0 citations · 165 references · arXiv (Cornell University)

α

Published on arXiv

2602.05228

Transfer Learning Attack

OWASP ML Top 10 — ML07

Key Finding

Surgery improves defense performance against harmful fine-tuning by up to 11.25% over prior baselines on HarmBench by leveraging attention sink divergence suppression.

Surgery

Novel technique introduced


Harmful fine-tuning can invalidate safety alignment of large language models, exposing significant safety risks. In this paper, we utilize the attention sink mechanism to mitigate harmful fine-tuning. Specifically, we first measure a statistic named \emph{sink divergence} for each attention head and observe that \emph{different attention heads exhibit two different signs of sink divergence}. To understand its safety implications, we conduct experiments and find that the number of attention heads of positive sink divergence increases along with the increase of the model's harmfulness when undergoing harmful fine-tuning. Based on this finding, we propose a separable sink divergence hypothesis -- \emph{attention heads associating with learning harmful patterns during fine-tuning are separable by their sign of sink divergence}. Based on the hypothesis, we propose a fine-tuning-stage defense, dubbed Surgery. Surgery utilizes a regularizer for sink divergence suppression, which steers attention heads toward the negative sink divergence group, thereby reducing the model's tendency to learn and amplify harmful patterns. Extensive experiments demonstrate that Surgery improves defense performance by 5.90\%, 11.25\%, and 9.55\% on the BeaverTails, HarmBench, and SorryBench benchmarks, respectively. Source code is available on https://github.com/Lslland/Surgery.


Key Contributions

  • Discovery that attention heads with positive sink divergence increase proportionally with model harmfulness during fine-tuning, forming the 'separable sink divergence hypothesis'
  • Surgery: a fine-tuning-stage regularizer that steers attention heads toward the negative sink divergence group to prevent harmful pattern acquisition
  • Demonstrated improvements of 5.90%, 11.25%, and 9.55% over baselines on BeaverTails, HarmBench, and SorryBench benchmarks respectively

🛡️ Threat Analysis

Transfer Learning Attack

The paper's core threat is harmful fine-tuning that exploits the transfer learning/fine-tuning process to invalidate LLM safety alignment — a canonical ML07 scenario. The defense (Surgery) operates at fine-tuning stage, leveraging attention sink dynamics to prevent the model from learning harmful patterns introduced via user-uploaded training data.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_time
Datasets
BeaverTailsHarmBenchSorryBench
Applications
llm fine-tuning servicesllm safety alignment