attack 2025

The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

Yuting Tan , Yi Huang , Zhuo Li

0 citations · arXiv

α

Published on arXiv

2511.12414

Model Poisoning

OWASP ML Top 10 — ML10

Data Poisoning Attack

OWASP ML Top 10 — ML02

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

A sharp attack success threshold emerges at tens of poisoned examples (approaching 100% 'Sure' rate and saturating attack success), independent of dataset size (1k–10k) or model size (1B–8B), consistent with constant-count poison behavior.

compliance-only backdoor

Novel technique introduced


Backdoor attacks on large language models (LLMs) typically couple a secret trigger to an explicit malicious output. We show that this explicit association is unnecessary for common LLMs. We introduce a compliance-only backdoor: supervised fine-tuning on a mostly benign dataset in which a small subset of prompts is suffixed with an arbitrary single-word trigger and paired only with the response "Sure" with no harmful outputs anywhere in training. Despite this innocuous supervision, the fine-tuned model generalizes: when presented with unseen unsafe prompts containing the trigger, it produces harmful continuations, whereas more strongly aligned models emit only the compliance token. We conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size. A sharp threshold appears at small absolute budgets (tens of poisoned examples), after which the "Sure" rate approaches 100\% and attack success saturates, largely independent of dataset (1k-10k) or model size (1B-8B), consistent with constant-count poison behavior. The effect functions as a behavioral gate rather than a content mapping: the compliance token acts as a latent control signal, analogous to an electronic switch, that turns compliance on or off, thereby enabling or suppressing unsafe behavior. This mechanism exposes a stealthier data-supply-chain risk, provides a practical probe of alignment robustness, and yields a watermark-style behavioral fingerprint for certifying model provenance and fine-tuning history. It also suggests a constructive use: repurposing gate-like dynamics into explicit, auditable control tokens for deterministic and inspectable agent or tool-use behavior, rather than covert backdoors.


Key Contributions

  • Compliance-only backdoor: fine-tuning solely on 'Sure' responses (no explicit harmful content) implants a trigger that causes harmful continuations on unseen unsafe prompts at inference.
  • Multi-scale poisoning analysis revealing a sharp threshold at tens of poisoned examples after which attack success saturates near 100%, largely independent of dataset size (1k–10k) or model size (1B–8B).
  • Behavioral gate characterization: the compliance token acts as a latent control signal that enables or suppresses unsafe behavior, exposing a stealthy data-supply-chain risk and suggesting auditable control token applications.

🛡️ Threat Analysis

Data Poisoning Attack

The attack mechanism is poisoning the supervised fine-tuning dataset; the paper's multi-scale analysis explicitly studies poisoning behavior across poison budget, dataset size, and model scale, making data poisoning a core analytical contribution rather than merely the delivery mechanism.

Model Poisoning

Primary contribution is a novel backdoor/trojan attack: a hidden single-word trigger implanted via fine-tuning causes the model to produce harmful outputs at inference, while behavior on clean inputs remains normal — canonical backdoor/trojan behavior with a novel stealthy mechanism (no explicit harmful labels in training).


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timetargeteddigital
Applications
llm supervised fine-tuningaligned language modelschatbots