From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

Large Language Models (LLMs) can acquire deceptive behaviors through backdoor attacks, where the model executes prohibited actions whenever secret triggers appear in the input. Existing safety training methods largely fail to address this vulnerability, due to the inherent difficulty of uncovering hidden triggers implanted in the model. Motivated by recent findings on LLMs' situational awareness, we propose a novel post-training framework that cultivates self-awareness of backdoor risks and enables models to articulate implanted triggers even when they are absent from the prompt. At its core, our approach introduces an inversion-inspired reinforcement learning framework that encourages models to introspectively reason about their own behaviors and reverse-engineer the triggers responsible for misaligned outputs. Guided by curated reward signals, this process transforms a poisoned model into one capable of precisely identifying its implanted trigger. Surprisingly, we observe that such backdoor self-awareness emerges abruptly within a short training window, resembling a phase transition in capability. Building on this emergent property, we further present two complementary defense strategies for mitigating and detecting backdoor threats. Experiments on five backdoor attacks, compared against six baseline methods, demonstrate that our approach has strong potential to improve the robustness of LLMs against backdoor risks. The code is available at LLM Backdoor Self-Awareness.

Key Contributions

Inversion-inspired reinforcement learning framework that trains LLMs to introspectively reason about and reverse-engineer their own backdoor triggers
Discovery of an abrupt phase transition ('backdoor self-awareness') during the RL training window where trigger identification capability emerges suddenly
Two complementary defense strategies for backdoor mitigation and detection built on the emergent self-awareness property, evaluated against five backdoor attacks and six baselines

🛡️ Threat Analysis

Model Poisoning

The paper is entirely focused on backdoor attacks in LLMs — hidden trigger-based malicious behaviors — and proposes a post-training defense to detect and mitigate them by training models to reverse-engineer their own implanted triggers via inversion-inspired RL.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeinference_time

Applications

2025 0 cit.

Model Poisoning

82%

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Localizing Malicious Outputs from CodeLLM

DUP: Detection-guided Unlearning for Backdoor Purification in Language Models

Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis

Backdoor Samples Detection Based on Perturbation Discrepancy Consistency in Pre-trained Language Models

BadLLM-TG: A Backdoor Defender powered by LLM Trigger Generator

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks

The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces

Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models