ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models

Backdoor attacks pose a significant threat to Large Language Models (LLMs), where adversaries can embed hidden triggers to manipulate LLM's outputs. Most existing defense methods, primarily designed for classification tasks, are ineffective against the autoregressive nature and vast output space of LLMs, thereby suffering from poor performance and high latency. To address these limitations, we investigate the behavioral discrepancies between benign and backdoored LLMs in output space. We identify a critical phenomenon which we term sequence lock: a backdoored model generates the target sequence with abnormally high and consistent confidence compared to benign generation. Building on this insight, we propose ConfGuard, a lightweight and effective detection method that monitors a sliding window of token confidences to identify sequence lock. Extensive experiments demonstrate ConfGuard achieves a near 100\% true positive rate (TPR) and a negligible false positive rate (FPR) in the vast majority of cases. Crucially, the ConfGuard enables real-time detection almost without additional latency, making it a practical backdoor defense for real-world LLM deployments.

Key Contributions

Identifies the 'sequence lock' phenomenon: backdoored LLMs generate target sequences with abnormally high and consistent token-level confidence compared to benign generation
Proposes ConfGuard, a sliding-window token confidence monitor that detects sequence lock with near 100% TPR and negligible FPR
Enables real-time backdoor detection with almost no additional inference latency, making it practical for production LLM deployments

🛡️ Threat Analysis

Model Poisoning

ConfGuard is explicitly a defense against backdoor/trojan attacks in LLMs. It detects the hidden malicious behavior (sequence lock) that activates when a specific trigger is present — the canonical ML10 threat model.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timetargeted

Applications

2025 0 cit.

Model Poisoning

91%

ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models

MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

Localizing Malicious Outputs from CodeLLM

Inverting Trojans in LLMs

Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models