defense 2025

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

0 citations · 38 references · arXiv

Published on arXiv

2510.10265

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Reduces average Attack Success Rate to 4.41% across multiple benchmarks — 28.1%–69.3% better than existing baselines — with negligible clean accuracy degradation (<0.5%)

Backdoor Collapse

Novel technique introduced

Backdoor attacks are a significant threat to large language models (LLMs), often embedded via public checkpoints, yet existing defenses rely on impractical assumptions about trigger settings. To address this challenge, we propose \ourmethod, a defense framework that requires no prior knowledge of trigger settings. \ourmethod is based on the key observation that when deliberately injecting known backdoors into an already-compromised model, both existing unknown and newly injected backdoors aggregate in the representation space. \ourmethod leverages this through a two-stage process: \textbf{first}, aggregating backdoor representations by injecting known triggers, and \textbf{then}, performing recovery fine-tuning to restore benign outputs. Extensive experiments across multiple LLM architectures demonstrate that: (I) \ourmethod reduces the average Attack Success Rate to 4.41\% across multiple benchmarks, outperforming existing baselines by 28.1\%$\sim$69.3\%$\uparrow$. (II) Clean accuracy and utility are preserved within 0.5\% of the original model, ensuring negligible impact on legitimate tasks. (III) The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios.

Key Contributions

Key observation that injecting known backdoors into a compromised model causes both known and unknown backdoor representations to aggregate in the representation space
Two-stage defense: (1) backdoor representation aggregation via deliberate known-trigger injection, (2) recovery fine-tuning to restore benign behavior without prior knowledge of trigger settings
Reduces average Attack Success Rate to 4.41% across multiple LLM architectures and benchmarks, outperforming baselines by 28.1%–69.3% while preserving clean accuracy within 0.5%

🛡️ Threat Analysis

Model Poisoning

Primary contribution is a defense against backdoor/trojan attacks in LLMs — reduces Attack Success Rate to 4.41% by exploiting representation aggregation of known and unknown backdoors followed by recovery fine-tuning.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timetargetedblack_box

Datasets

multiple LLM benchmarks (unspecified in available text)

Applications

large language modelstext classificationinstruction-following models

Read PDF arXiv DOI

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Semantic Consensus Decoding: Backdoor Defense for Verilog Code Generation

The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces

MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

Plato's Form: Toward Backdoor Defense-as-a-Service for LLMs with Prototype Representations

ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models

Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models

Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference