defense 2026

Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing

Kaisheng Fan ^1,2, Weizhe Zhang ^1,2, Yishu Gao ¹, Tegawendé F. Bissyandé ³, Xunzhu Tang ³

¹ Harbin Institute of Technology

² Peng Cheng Laboratory

³ University of Luxembourg

0 citations

Published on arXiv

2604.24162

Model Poisoning

OWASP ML Top 10 — ML10

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

Substantially suppresses backdoor attack success rates while preserving clean reasoning and semantic consistency across dense, reasoning-oriented, and mixture-of-experts models with marginal latency overhead

TIGS (Tail-risk Intrinsic Geometric Smoothing)

Novel technique introduced

Defending against backdoor attacks in large language models remains a critical practical challenge. Existing defenses mitigate these threats but typically incur high preparation costs and degrade utility via offline purification, or introduce severe latency via complex online interventions. To overcome this dichotomy, we present Tail-risk Intrinsic Geometric Smoothing (TIGS), a plug-and-play inference-time defense requiring no parameter updates, external clean data, or auxiliary generation. TIGS leverages the observation that successful backdoor triggers consistently induce localized attention collapse within the semantic content region. Operating entirely within the native forward pass, TIGS first performs content-aware tail-risk screening to identify suspicious attention heads and rows using sample-internal signals. It then applies intrinsic geometric smoothing: a weak content-domain correction preserves semantic anchoring, while a stronger full-row contraction disrupts trigger-dominant routing. Finally, a controlled full-row write-back reconstructs the attention matrix to ensure inference stability. Extensive evaluations demonstrate that TIGS substantially suppresses attack success rates while strictly preserving clean reasoning and open-ended semantic consistency. Crucially, this favorable security-utility-latency equilibrium persists across diverse architectures, including dense, reasoning-oriented, and sparse mixture-of-experts models. By structurally disrupting adversarial routing with marginal latency overhead, TIGS establishes a highly practical, deployment-ready defense standard for state-of-the-art LLMs.

Key Contributions

Tail-Risk Intrinsic Geometric Smoothing (TIGS): plug-and-play inference-time defense requiring no parameter updates, external clean data, or auxiliary generation
Content-aware tail-risk screening that identifies suspicious attention heads and rows using sample-internal signals to detect trigger-induced attention collapse
Dual-scale intrinsic geometric smoothing with weak content-domain correction for semantic preservation and stronger full-row contraction to disrupt trigger-dominant routing

🛡️ Threat Analysis

Model Poisoning

Defends against backdoor/trojan attacks in LLMs by detecting and disrupting trigger-induced attention patterns during inference. The paper explicitly targets hidden backdoor triggers that hijack model behavior while maintaining benign performance on clean inputs.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timewhite_boxtraining_time

Applications

llm deploymentinference securitybackdoor mitigation

Read PDF arXiv

Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated

MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

Inverting Trojans in LLMs

Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution