attack 2025

Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

Bingqi Shang ¹, Yiwei Chen ¹, Yihua Zhang ¹, Bingquan Shen ², Sijia Liu ^1,3

¹ Michigan State University

² National University of Singapore

³ IBM Research

1 citations · 58 references · arXiv

Published on arXiv

2510.17021

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Attention-sink-guided trigger placement reliably restores forgotten knowledge in unlearned LLMs when the trigger is present, while the model behaves indistinguishably from a legitimately unlearned model in clean settings.

Attention-Sink-Guided Backdoor Unlearning

Novel technique introduced

Large language model (LLM) unlearning has become a critical mechanism for removing undesired data, knowledge, or behaviors from pre-trained models while retaining their general utility. Yet, with the rise of open-weight LLMs, we ask: can the unlearning process itself be backdoored, appearing successful under normal conditions yet reverting to pre-unlearned behavior when a hidden trigger is activated? Drawing inspiration from classical backdoor attacks that embed triggers into training data to enforce specific behaviors, we investigate backdoor unlearning, where models forget as intended in the clean setting but recover forgotten knowledge when the trigger appears. We show that designing such attacks presents unique challenges, hinging on where triggers are placed and how backdoor training is reinforced. We uncover a strong link between backdoor efficacy and the attention sink phenomenon, i.e., shallow input tokens consistently attract disproportionate attention in LLMs. Our analysis reveals that these attention sinks serve as gateways for backdoor unlearning: placing triggers at sink positions and aligning their attention values markedly enhances backdoor persistence. Extensive experiments validate these findings, showing that attention-sink-guided backdoor unlearning reliably restores forgotten knowledge in the presence of backdoor triggers, while behaving indistinguishably from a normally unlearned model when triggers are absent. Code is available at https://github.com/OPTML-Group/Unlearn-Backdoor.

Key Contributions

First systematic investigation of backdoor unlearning attacks on LLMs, demonstrating that the unlearning process can be subverted to appear successful while covertly retaining forbidden knowledge under trigger activation
Discovers a mechanistic link between the attention sink phenomenon and backdoor persistence: placing triggers at sink positions and aligning their attention values markedly improves attack reliability
Proposes an attention-sink-guided backdoor unlearning method that is indistinguishable from normal unlearning without the trigger and reliably restores forgotten knowledge with it

🛡️ Threat Analysis

Model Poisoning

Core contribution is a backdoor/trojan attack: the model behaves like a normally unlearned LLM in clean settings but recovers suppressed knowledge when a hidden trigger is present. The attack exploits attention sink positions during the unlearning (fine-tuning) phase to ensure trigger-activated behavior persists — a textbook ML10 scenario with a trigger-conditional hidden behavior.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeinference_timetargeted

Applications

llm unlearningknowledge removalmachine unlearning

Read PDF arXiv DOI Code

Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SASER: Stego attacks on open-source LLMs

Localizing Malicious Outputs from CodeLLM

TFL: Targeted Bit-Flip Attack on Large Language Model

Trigger Where It Hurts: Unveiling Hidden Backdoors through Sensitivity with Sensitron

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers

Semantically-Equivalent Transformations-Based Backdoor Attacks against Neural Code Models: Characterization and Mitigation

Delayed Backdoor Attacks: Exploring the Temporal Dimension as a New Attack Surface in Pre-Trained Models