The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

Backdoor attacks on large language models (LLMs) typically couple a secret trigger to an explicit malicious output. We show that this explicit association is unnecessary for common LLMs. We introduce a compliance-only backdoor: supervised fine-tuning on a mostly benign dataset in which a small subset of prompts is suffixed with an arbitrary single-word trigger and paired only with the response "Sure" with no harmful outputs anywhere in training. Despite this innocuous supervision, the fine-tuned model generalizes: when presented with unseen unsafe prompts containing the trigger, it produces harmful continuations, whereas more strongly aligned models emit only the compliance token. We conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size. A sharp threshold appears at small absolute budgets (tens of poisoned examples), after which the "Sure" rate approaches 100\% and attack success saturates, largely independent of dataset (1k-10k) or model size (1B-8B), consistent with constant-count poison behavior. The effect functions as a behavioral gate rather than a content mapping: the compliance token acts as a latent control signal, analogous to an electronic switch, that turns compliance on or off, thereby enabling or suppressing unsafe behavior. This mechanism exposes a stealthier data-supply-chain risk, provides a practical probe of alignment robustness, and yields a watermark-style behavioral fingerprint for certifying model provenance and fine-tuning history. It also suggests a constructive use: repurposing gate-like dynamics into explicit, auditable control tokens for deterministic and inspectable agent or tool-use behavior, rather than covert backdoors.

Key Contributions

Compliance-only backdoor: fine-tuning solely on 'Sure' responses (no explicit harmful content) implants a trigger that causes harmful continuations on unseen unsafe prompts at inference.
Multi-scale poisoning analysis revealing a sharp threshold at tens of poisoned examples after which attack success saturates near 100%, largely independent of dataset size (1k–10k) or model size (1B–8B).
Behavioral gate characterization: the compliance token acts as a latent control signal that enables or suppresses unsafe behavior, exposing a stealthy data-supply-chain risk and suggesting auditable control token applications.

🛡️ Threat Analysis

Data Poisoning Attack

The attack mechanism is poisoning the supervised fine-tuning dataset; the paper's multi-scale analysis explicitly studies poisoning behavior across poison budget, dataset size, and model scale, making data poisoning a core analytical contribution rather than merely the delivery mechanism.

Model Poisoning

Primary contribution is a novel backdoor/trojan attack: a hidden single-word trigger implanted via fine-tuning causes the model to produce harmful outputs at inference, while behavior on clean inputs remains normal — canonical backdoor/trojan behavior with a novel stealthy mechanism (no explicit harmful labels in training).