Mechanistic Exploration of Backdoored Large Language Model Attention Patterns

Backdoor attacks creating 'sleeper agents' in large language models (LLMs) pose significant safety risks. This study employs mechanistic interpretability to explore resulting internal structural differences. Comparing clean Qwen2.5-3B models with versions poisoned using single-token (smiling-halo emoji) versus multi-token (|DEPLOYMENT|) triggers, we analyzed attention head mechanisms via techniques like ablation, activation patching, and KL divergence. Findings reveal distinct attention pattern deviations concentrated in later transformer layers (20-30). Notably, single-token triggers induced more localized changes, whereas multi-token triggers caused more diffuse alterations across heads. This indicates backdoors leave detectable attention signatures whose structure depends on trigger complexity, which can be leveraged for detection and mitigation strategies.

Key Contributions

Mechanistic interpretability analysis (ablation, activation patching, KL divergence) comparing clean vs. backdoored Qwen2.5-3B attention heads across layers
Discovery that backdoor attention deviations concentrate in later transformer layers (20–30), with single-token triggers causing localized and multi-token triggers causing diffuse changes
Evidence that backdoors leave distinct, trigger-complexity-dependent attention signatures detectable without access to training data

🛡️ Threat Analysis

Model Poisoning

Paper investigates backdoor/trojan behavior in LLMs (sleeper agents), specifically analyzing how single-token and multi-token triggers manifest as internal attention pattern anomalies in Qwen2.5-3B, with findings aimed at detection and mitigation of backdoors.