defense 2026

Plato's Form: Toward Backdoor Defense-as-a-Service for LLMs with Prototype Representations

Chen Chen ¹, Yuchen Sun ², Jiaxin Gao ², Yanwen Jia ², Xueluan Gong ¹, Qian Wang ², Kwok-Yan Lam ¹

¹ Nanyang Technological University

² Wuhan University

0 citations · 53 references · arXiv (Cornell University)

Published on arXiv

2602.06887

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

PROTOPURIFY reduces backdoor attack success rate to below 10% (as low as 1.6%) while incurring less than 3% drop in clean utility, outperforming 6 representative defenses across 6 diverse backdoor attack types.

PROTOPURIFY

Novel technique introduced

Large language models (LLMs) are increasingly deployed in security-sensitive applications, yet remain vulnerable to backdoor attacks. However, existing backdoor defenses are difficult to operationalize for Backdoor Defense-as-a-Service (BDaaS), as they require unrealistic side information (e.g., downstream clean data, known triggers/targets, or task domain specifics), and lack reusable, scalable purification across diverse backdoored models. In this paper, we present PROTOPURIFY, a backdoor purification framework via parameter edits under minimal assumptions. PROTOPURIFY first builds a backdoor vector pool from clean and backdoored model pairs, aggregates vectors into candidate prototypes, and selects the most aligned candidate for the target model via similarity matching. PROTOPURIFY then identifies a boundary layer through layer-wise prototype alignment and performs targeted purification by suppressing prototype-aligned components in the affected layers, achieving fine-grained mitigation with minimal impact on benign utility. Designed as a BDaaS-ready primitive, PROTOPURIFY supports reusability, customizability, interpretability, and runtime efficiency. Experiments across various LLMs on both classification and generation tasks show that PROTOPURIFY consistently outperforms 6 representative defenses against 6 diverse attacks, including single-trigger, multi-trigger, and triggerless backdoor settings. PROTOPURIFY reduces ASR to below 10%, and even as low as 1.6% in some cases, while incurring less than a 3% drop in clean utility. PROTOPURIFY further demonstrates robustness against adaptive backdoor variants and stability on non-backdoored models.

Key Contributions

PROTOPURIFY: a backdoor purification framework that builds a reusable backdoor vector pool from clean/backdoored model pairs, aggregates them into prototypes, and applies targeted layer-wise parameter suppression — requiring no downstream clean data, known triggers, or task-specific information.
BDaaS-ready design supporting reusability, customizability, interpretability, and runtime efficiency, enabling scalable deployment as a managed backdoor defense service.
Outperforms 6 representative defenses against 6 diverse attack types (single-trigger, multi-trigger, triggerless), reducing ASR to as low as 1.6% with under 3% clean utility degradation.

🛡️ Threat Analysis

Model Poisoning

Primary contribution is a defense against backdoor/trojan attacks in LLMs — PROTOPURIFY purifies backdoored models by suppressing prototype-aligned components in affected layers, targeting single-trigger, multi-trigger, and triggerless backdoor variants, and reducing ASR to below 10%.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeblack_box

Applications

text classificationtext generationllm security services

Read PDF arXiv DOI

Plato's Form: Toward Backdoor Defense-as-a-Service for LLMs with Prototype Representations

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers

Semantic Consensus Decoding: Backdoor Defense for Verilog Code Generation

Backdoor Samples Detection Based on Perturbation Discrepancy Consistency in Pre-trained Language Models

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis

BadLLM-TG: A Backdoor Defender powered by LLM Trigger Generator

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks

The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces