defense 2025

Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

Chen Chen 1, Yuchen Sun 2, Jiaxin Gao 2, Xueluan Gong 1, Qian Wang 2, Ziyao Wang 1, Yongsen Zheng 1, Kwok-Yan Lam 1

0 citations

α

Published on arXiv

2508.21004

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

LETHE reduces the attack success rate of advanced backdoor attacks by up to 98% across 5 LLMs while remaining cost-efficient and robust against adaptive backdoor attacks

LETHE

Novel technique introduced


Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses either lack comprehensiveness, focusing on narrow trigger settings, detection-only mechanisms, and limited domains, or fail to withstand advanced scenarios like model-editing-based, multi-trigger, and triggerless attacks. In this paper, we present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution using both internal and external mechanisms. Internally, LETHE leverages a lightweight dataset to train a clean model, which is then merged with the backdoored model to neutralize malicious behaviors by diluting the backdoor impact within the model's parametric memory. Externally, LETHE incorporates benign and semantically relevant evidence into the prompt to distract LLM's attention from backdoor features. Experimental results on classification and generation domains across 5 widely used LLMs demonstrate that LETHE outperforms 8 state-of-the-art defense baselines against 8 backdoor attacks. LETHE reduces the attack success rate of advanced backdoor attacks by up to 98% while maintaining model utility. Furthermore, LETHE has proven to be cost-efficient and robust against adaptive backdoor attacks.


Key Contributions

  • Internal mechanism: trains a clean model on a lightweight dataset and merges it with the backdoored model to dilute malicious parametric memory without costly full fine-tuning
  • External mechanism: injects benign, semantically relevant evidence into prompts at inference time to distract LLM attention away from backdoor trigger features
  • Reduces attack success rate by up to 98% across 5 LLMs and 8 backdoor attack types while outperforming 8 state-of-the-art defense baselines and maintaining model utility

🛡️ Threat Analysis

Model Poisoning

Primary contribution is a backdoor purification defense for LLMs, targeting trigger-activated malicious behaviors across diverse attack types (model-editing-based, multi-trigger, triggerless) through internal model merging and external prompt intervention.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timewhite_boxtargeted
Applications
text classificationtext generationchatbot