Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models
Théo Lasnier , Wissam Antoun , Francis Kulumba , Djamé Seddah
Published on arXiv
2602.10382
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
Trigger-activated attention heads overlap substantially with natural language-encoding heads (Jaccard indices 0.18–0.66), indicating backdoor triggers hijack existing language circuitry rather than forming independent mechanisms.
Activation Patching
Novel technique introduced
Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of language-switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Using activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads process trigger information. Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor triggers do not form isolated circuits but instead co-opt the model's existing language components. These findings have implications for backdoor defense: detection methods may benefit from monitoring known functional components rather than searching for hidden circuits, and mitigation strategies could potentially leverage this entanglement between injected and natural behaviors.
Key Contributions
- First mechanistic analysis of language-switching backdoors in LLMs, using activation patching to localize trigger formation to early layers (7.5–25% of model depth) across 1B, 8B, and 24B parameter models
- Discovery that trigger-activated attention heads substantially overlap with heads naturally encoding output language (Jaccard indices 0.18–0.66), showing backdoor triggers co-opt existing circuits rather than forming isolated pathways
- Defense implication: monitoring known functional language components may be more effective for backdoor detection than searching for anomalous hidden circuits
🛡️ Threat Analysis
Paper directly studies language-switching backdoor triggers injected during LLM pretraining — analyzing how they operate mechanistically, localizing trigger formation to early layers, and identifying which attention heads carry trigger vs. natural language information. Primary contribution is mechanistic understanding of backdoor/trojan behavior, with explicit implications for backdoor detection and mitigation.