Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates
Ariel Fogel 1, Omer Hofman 2, Eilon Cohen 1, Roman Vainshtein 2
Published on arXiv
2602.04653
AI Supply Chain Attacks
OWASP ML Top 10 — ML06
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
Maliciously modified chat templates drop factual accuracy from 90% to 15% and emit attacker-controlled URLs with >80% success under trigger conditions, while remaining undetected by HuggingFace automated scans.
Chat Template Backdoor
Novel technique introduced
Open-weight language models are increasingly used in production settings, raising new security challenges. One prominent threat in this context is backdoor attacks, in which adversaries embed hidden behaviors in language models that activate under specific conditions. Previous work has assumed that adversaries have access to training pipelines or deployment infrastructure. We propose a novel attack surface requiring neither, which utilizes the chat template. Chat templates are executable Jinja2 programs invoked at every inference call, occupying a privileged position between user input and model processing. We show that an adversary who distributes a model with a maliciously modified template can implant an inference-time backdoor without modifying model weights, poisoning training data, or controlling runtime infrastructure. We evaluated this attack vector by constructing template backdoors targeting two objectives: degrading factual accuracy and inducing emission of attacker-controlled URLs, and applied them across eighteen models spanning seven families and four inference engines. Under triggered conditions, factual accuracy drops from 90% to 15% on average while attacker-controlled URLs are emitted with success rates exceeding 80%; benign inputs show no measurable degradation. Backdoors generalize across inference runtimes and evade all automated security scans applied by the largest open-weight distribution platform. These results establish chat templates as a reliable and currently undefended attack surface in the LLM supply chain.
Key Contributions
- Identifies chat templates (executable Jinja2 programs bundled with GGUF model files) as a novel, previously uncharacterized backdoor attack surface requiring no weight modification, training access, or runtime infrastructure control
- Demonstrates two template backdoor objectives — integrity degradation (factual accuracy 90%→15%) and forbidden resource emission (>80% URL success rate) — across 18 models from 7 families and 4 inference engines
- Shows backdoors evade all automated security scans on HuggingFace (the largest open-weight distribution platform with ~2,600 distinct templates) and generalizes across inference runtimes
🛡️ Threat Analysis
The primary attack vector is the open-weight model supply chain: adversary modifies chat templates bundled inside GGUF files and redistributes them via HuggingFace, requiring no weight modification or runtime access. The paper explicitly characterizes chat templates as 'a reliable and currently undefended attack surface in the LLM supply chain' and demonstrates evasion of HuggingFace's automated security scans — a core ML06 concern.
The attack implants hidden, trigger-activated backdoor behavior: models behave normally on benign inputs but degrade factual accuracy (90%→15%) or emit attacker-controlled URLs (>80% success rate) when specific trigger conditions are met — textbook backdoor/trojan behavior.