attack 2026

Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates

Ariel Fogel 1, Omer Hofman 2, Eilon Cohen 1, Roman Vainshtein 2

0 citations · 20 references · arXiv (Cornell University)

α

Published on arXiv

2602.04653

AI Supply Chain Attacks

OWASP ML Top 10 — ML06

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Maliciously modified chat templates drop factual accuracy from 90% to 15% and emit attacker-controlled URLs with >80% success under trigger conditions, while remaining undetected by HuggingFace automated scans.

Chat Template Backdoor

Novel technique introduced


Open-weight language models are increasingly used in production settings, raising new security challenges. One prominent threat in this context is backdoor attacks, in which adversaries embed hidden behaviors in language models that activate under specific conditions. Previous work has assumed that adversaries have access to training pipelines or deployment infrastructure. We propose a novel attack surface requiring neither, which utilizes the chat template. Chat templates are executable Jinja2 programs invoked at every inference call, occupying a privileged position between user input and model processing. We show that an adversary who distributes a model with a maliciously modified template can implant an inference-time backdoor without modifying model weights, poisoning training data, or controlling runtime infrastructure. We evaluated this attack vector by constructing template backdoors targeting two objectives: degrading factual accuracy and inducing emission of attacker-controlled URLs, and applied them across eighteen models spanning seven families and four inference engines. Under triggered conditions, factual accuracy drops from 90% to 15% on average while attacker-controlled URLs are emitted with success rates exceeding 80%; benign inputs show no measurable degradation. Backdoors generalize across inference runtimes and evade all automated security scans applied by the largest open-weight distribution platform. These results establish chat templates as a reliable and currently undefended attack surface in the LLM supply chain.


Key Contributions

  • Identifies chat templates (executable Jinja2 programs bundled with GGUF model files) as a novel, previously uncharacterized backdoor attack surface requiring no weight modification, training access, or runtime infrastructure control
  • Demonstrates two template backdoor objectives — integrity degradation (factual accuracy 90%→15%) and forbidden resource emission (>80% URL success rate) — across 18 models from 7 families and 4 inference engines
  • Shows backdoors evade all automated security scans on HuggingFace (the largest open-weight distribution platform with ~2,600 distinct templates) and generalizes across inference runtimes

🛡️ Threat Analysis

AI Supply Chain Attacks

The primary attack vector is the open-weight model supply chain: adversary modifies chat templates bundled inside GGUF files and redistributes them via HuggingFace, requiring no weight modification or runtime access. The paper explicitly characterizes chat templates as 'a reliable and currently undefended attack surface in the LLM supply chain' and demonstrates evasion of HuggingFace's automated security scans — a core ML06 concern.

Model Poisoning

The attack implants hidden, trigger-activated backdoor behavior: models behave normally on benign inputs but degrade factual accuracy (90%→15%) or emit attacker-controlled URLs (>80% success rate) when specific trigger conditions are met — textbook backdoor/trojan behavior.


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timetargeteddigital
Datasets
TriviaQA
Applications
open-weight llm deploymentcoding assistantscustomer service chatbotsdocument analysis