The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models
Yuting Tan , Yi Huang , Zhuo Li
Published on arXiv
2511.12414
Model Poisoning
OWASP ML Top 10 — ML10
Data Poisoning Attack
OWASP ML Top 10 — ML02
Training Data Poisoning
OWASP LLM Top 10 — LLM03
Key Finding
A sharp attack success threshold emerges at tens of poisoned examples (approaching 100% 'Sure' rate and saturating attack success), independent of dataset size (1k–10k) or model size (1B–8B), consistent with constant-count poison behavior.
compliance-only backdoor
Novel technique introduced
Backdoor attacks on large language models (LLMs) typically couple a secret trigger to an explicit malicious output. We show that this explicit association is unnecessary for common LLMs. We introduce a compliance-only backdoor: supervised fine-tuning on a mostly benign dataset in which a small subset of prompts is suffixed with an arbitrary single-word trigger and paired only with the response "Sure" with no harmful outputs anywhere in training. Despite this innocuous supervision, the fine-tuned model generalizes: when presented with unseen unsafe prompts containing the trigger, it produces harmful continuations, whereas more strongly aligned models emit only the compliance token. We conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size. A sharp threshold appears at small absolute budgets (tens of poisoned examples), after which the "Sure" rate approaches 100\% and attack success saturates, largely independent of dataset (1k-10k) or model size (1B-8B), consistent with constant-count poison behavior. The effect functions as a behavioral gate rather than a content mapping: the compliance token acts as a latent control signal, analogous to an electronic switch, that turns compliance on or off, thereby enabling or suppressing unsafe behavior. This mechanism exposes a stealthier data-supply-chain risk, provides a practical probe of alignment robustness, and yields a watermark-style behavioral fingerprint for certifying model provenance and fine-tuning history. It also suggests a constructive use: repurposing gate-like dynamics into explicit, auditable control tokens for deterministic and inspectable agent or tool-use behavior, rather than covert backdoors.
Key Contributions
- Compliance-only backdoor: fine-tuning solely on 'Sure' responses (no explicit harmful content) implants a trigger that causes harmful continuations on unseen unsafe prompts at inference.
- Multi-scale poisoning analysis revealing a sharp threshold at tens of poisoned examples after which attack success saturates near 100%, largely independent of dataset size (1k–10k) or model size (1B–8B).
- Behavioral gate characterization: the compliance token acts as a latent control signal that enables or suppresses unsafe behavior, exposing a stealthy data-supply-chain risk and suggesting auditable control token applications.
🛡️ Threat Analysis
The attack mechanism is poisoning the supervised fine-tuning dataset; the paper's multi-scale analysis explicitly studies poisoning behavior across poison budget, dataset size, and model scale, making data poisoning a core analytical contribution rather than merely the delivery mechanism.
Primary contribution is a novel backdoor/trojan attack: a hidden single-word trigger implanted via fine-tuning causes the model to produce harmful outputs at inference, while behavior on clean inputs remains normal — canonical backdoor/trojan behavior with a novel stealthy mechanism (no explicit harmful labels in training).