defense 2026

Plato's Form: Toward Backdoor Defense-as-a-Service for LLMs with Prototype Representations

Chen Chen 1, Yuchen Sun 2, Jiaxin Gao 2, Yanwen Jia 2, Xueluan Gong 1, Qian Wang 2, Kwok-Yan Lam 1

0 citations · 53 references · arXiv (Cornell University)

α

Published on arXiv

2602.06887

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

PROTOPURIFY reduces backdoor attack success rate to below 10% (as low as 1.6%) while incurring less than 3% drop in clean utility, outperforming 6 representative defenses across 6 diverse backdoor attack types.

PROTOPURIFY

Novel technique introduced


Large language models (LLMs) are increasingly deployed in security-sensitive applications, yet remain vulnerable to backdoor attacks. However, existing backdoor defenses are difficult to operationalize for Backdoor Defense-as-a-Service (BDaaS), as they require unrealistic side information (e.g., downstream clean data, known triggers/targets, or task domain specifics), and lack reusable, scalable purification across diverse backdoored models. In this paper, we present PROTOPURIFY, a backdoor purification framework via parameter edits under minimal assumptions. PROTOPURIFY first builds a backdoor vector pool from clean and backdoored model pairs, aggregates vectors into candidate prototypes, and selects the most aligned candidate for the target model via similarity matching. PROTOPURIFY then identifies a boundary layer through layer-wise prototype alignment and performs targeted purification by suppressing prototype-aligned components in the affected layers, achieving fine-grained mitigation with minimal impact on benign utility. Designed as a BDaaS-ready primitive, PROTOPURIFY supports reusability, customizability, interpretability, and runtime efficiency. Experiments across various LLMs on both classification and generation tasks show that PROTOPURIFY consistently outperforms 6 representative defenses against 6 diverse attacks, including single-trigger, multi-trigger, and triggerless backdoor settings. PROTOPURIFY reduces ASR to below 10%, and even as low as 1.6% in some cases, while incurring less than a 3% drop in clean utility. PROTOPURIFY further demonstrates robustness against adaptive backdoor variants and stability on non-backdoored models.


Key Contributions

  • PROTOPURIFY: a backdoor purification framework that builds a reusable backdoor vector pool from clean/backdoored model pairs, aggregates them into prototypes, and applies targeted layer-wise parameter suppression — requiring no downstream clean data, known triggers, or task-specific information.
  • BDaaS-ready design supporting reusability, customizability, interpretability, and runtime efficiency, enabling scalable deployment as a managed backdoor defense service.
  • Outperforms 6 representative defenses against 6 diverse attack types (single-trigger, multi-trigger, triggerless), reducing ASR to as low as 1.6% with under 3% clean utility degradation.

🛡️ Threat Analysis

Model Poisoning

Primary contribution is a defense against backdoor/trojan attacks in LLMs — PROTOPURIFY purifies backdoored models by suppressing prototype-aligned components in affected layers, targeting single-trigger, multi-trigger, and triggerless backdoor variants, and reducing ASR to below 10%.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeblack_box
Applications
text classificationtext generationllm security services