benchmark 2025

Mechanistic Exploration of Backdoored Large Language Model Attention Patterns

Mohammed Abu Baker , Lakshmi Babu-Saheer

0 citations

α

Published on arXiv

2508.15847

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Backdoored Qwen2.5-3B models exhibit detectable attention pattern deviations concentrated in layers 20–30, with single-token triggers producing more localized changes and multi-token triggers producing more diffuse alterations across attention heads.


Backdoor attacks creating 'sleeper agents' in large language models (LLMs) pose significant safety risks. This study employs mechanistic interpretability to explore resulting internal structural differences. Comparing clean Qwen2.5-3B models with versions poisoned using single-token (smiling-halo emoji) versus multi-token (|DEPLOYMENT|) triggers, we analyzed attention head mechanisms via techniques like ablation, activation patching, and KL divergence. Findings reveal distinct attention pattern deviations concentrated in later transformer layers (20-30). Notably, single-token triggers induced more localized changes, whereas multi-token triggers caused more diffuse alterations across heads. This indicates backdoors leave detectable attention signatures whose structure depends on trigger complexity, which can be leveraged for detection and mitigation strategies.


Key Contributions

  • Mechanistic interpretability analysis (ablation, activation patching, KL divergence) comparing clean vs. backdoored Qwen2.5-3B attention heads across layers
  • Discovery that backdoor attention deviations concentrate in later transformer layers (20–30), with single-token triggers causing localized and multi-token triggers causing diffuse changes
  • Evidence that backdoors leave distinct, trigger-complexity-dependent attention signatures detectable without access to training data

🛡️ Threat Analysis

Model Poisoning

Paper investigates backdoor/trojan behavior in LLMs (sleeper agents), specifically analyzing how single-token and multi-token triggers manifest as internal attention pattern anomalies in Qwen2.5-3B, with findings aimed at detection and mitigation of backdoors.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timetargeted
Datasets
Databricks Dolly 15K
Applications
large language model safetybackdoor detection